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4  Introduction 

Breast  cancer  is  a  major  cause  of  death  among  women  over  the  age  of  forty  [1].  Mam¬ 
mography  is  the  most  effective  diagnostic  procedure  for  the  early  detection  of  breast  can¬ 
cer  [2, 3].  Mammography  is  not,  however,  perfect.  Between  10-30%  of  women  who  have 
breast  cancer  and  undergo  mammography  have  negative  mammograms  [4-7].  Of  these,  ra¬ 
diologists  have  determined,  retrospectively,  that  two-thirds  of  the  cancers  could  have  been 
detected  [5, 6, 8,  9].  One  possible  means  by  which  to  decrease  this  number  is  to  have  two 
radiologists  read  the  mammograms.  This  method  has  been  shown  to  increase  sensitivity  by 
as  much  as  15%,  [10, 11]  but  can  be  costly  both  financially  and  with  respect  to  time.  A 
computer-aided  diagnostic  scheme  may  act  as  an  inexpensive  second  reading  method.  The 
final  decision  would  be  made  by  the  radiologist. 

This  research  sought  to  answer  questions  that  arise  when  using  pattern  classifiers  in  deci¬ 
sion  making  applications.  Problems  occur  when  the  number  of  inputs  to  the  pattern  classifier 
become  large.  For  this  reason,  genetic  algorithms  and  other  feature  selection  techniques  were 
studied  to  alleviate  this  problem.  The  purpose  of  this  research  was  to  study  and  develop 
feature  selection  and  pattern  classification  methods  to  improve  the  performance  of  CAD 
schemes.  Specific  emphasis  was  placed  on  using  the  developed  methods  in  the  computerized 
detection  of  mass  lesions  in  mammography. 
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5  Body 

5.1  Technical  Objectives 

The  objectives  of  this  project  were  as  follows: 

•  Development  of  a  genetic  algorithm  for  the  optimization  of  artificial  neural  network 
inputs. 

•  Comparison  of  a  genetic  algorithm  with  other  selection  and  optimization  methods 
including  previously  used  selection  methods. 

•  Analysis  of  features  selected  by  genetic  algorithm  and  comparison  of  those  features 
with  visual  techniques  employed  by  radiologists. 

•  Development  of  a  parallel  genetic  algorithm  to  improve  performance  of  the  search  and 
to  provide  an  even  greater  performance  increase  to  the  mass  detection  CAD  program. 

5.2  Methods 

5.3  Development  of  Classification  Methods 

We  have  performed  extensive  analysis  of  two  methods  of  pattern  classification.  First,  we 
studied  the  ability  of  a  classifier  called  a  Bayesian  ANN  to  approximate  the  ideal  observer 
(the  theoretically  optimal  classifier)  given  datasets  of  limited  size.  A  full  analysis  of  this  is 
given  in  reference  [12]  which  is  attached  as  Appendix  A. 

We  also  developed  a  novel  classifier  training  strategy  which  directly  optimizes  the  ROC 
curve  of  a  given  classifier  on  a  given  training  dataset.  This  approach  to  classifier  training 
was  found  to  be  useful  for  “simple”  classifiers  such  as  rule-based  classifiers.  A  full  analysis 
of  this  method  is  given  in  reference  [13]  which  is  attached  as  Appendix  B. 

We  have  applied  various  feature  selection  methods  to  select  subsets  of  features  for  both 
of  these  classifiers  using  data  extracted  from  a  database  177  screening  mammography  cases. 
Validation  results  were  computed  and  used  to  compare  the  various  classifiers  and  methods 
of  feature  selection  as  is  discussed  later. 

5.3.1  Development  of  Feature  Selection  Methods 

We  have  studied  various  feature  selection  methods  for  selecting  features  to  be  used  within 
an  artificial  neural  network  (ANN)  classifier  and  methods  for  selecting  features  for  rule-based 
classifiers.  A  detailed  summary  of  these  methods  can  be  found  in  references  [14]  which  is 
also  attached  as  Appendix  C. 

We  also  endeavored  to  use  these  methods  to  select  features  for  use  in  an  actual  comput¬ 
erized  detection  system.  A  summary  of  these  results  is  shown  in  Appendix  D  which  is  a 
paper  under  review  at  Medical  Physics.  In  this  paper,  it  is  shown  that  the  Bayesian  ANN 
outperforms  a  rule-based  classifier  trained  using  the  multiobjective  approach.  Furthermore, 
both  genetic  algorithm  feature  selection  and  a  forward  selection  method  performed  best  for 
selecting  features  for  Bayesian  ANNs. 
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5.3.2  Analysis  of  Feature  Selection 

During  the  course  of  our  research  on  pattern  classification  and  feature  selection,  we 
studied  in  more  detail,  some  of  the  fundamental  problems  associated  with  feature  selection 
[15].  A  paper  on  this  subject  is  attached  as  Appendix  E.  It  was  found  that  the  ability  of 
a  feature  selection  method  to  select  the  optimal  subset  of  features  decreases  as  the  number 
of  feature  from  which  to  select  increases  and  as  the  database  used  to  select  the  features 
decreases  in  size.  For  a  simple  feature  selection  problem',  we  derived  the  probability  that  the 
optimal  subset  of  features  will  be  selected. 

5.3.3  Analysis  of  Features 

A  detailed  analysis  of  the  features  selected  by  the  various  feature  selection  methods  is 
also  contained  in  Appendix  D.  The  features  we  have  found  to  be  most  useful  in  distinguishing 
between  malignant  mass  lesions  and  false  candidates  in  mammography  were  gradient-based 
features  such  as  RGI  [16]  and  intensity-based  measures  such  as  the  contrast.  The  geometric 
features  were  found  to  be  useful  but  not  to  the  extent  shown  in  previous  studies  [17].  This  is 
due  to  a  novel  lesion  segmentation  technique  that  we  have  developed  for  the  mass  detection 
CAD  system  which  always  returns  lesion-like  results  even  in  non-mass  image  areas  [16]. 

5.3.4  Development  of  a  Parallel  Genetic  Algorithm 

A  group  at  Argonne  National  labs  has  developed  a  parallel  genetic  algorithm  package 
called  the  PGAPack  [18].  This  package  was  found  to  be  suitable  for  the  implementation  of 
GA  feature  selection. 
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6  Key  Research  Accomplishments 

•  Development  of  an  RGI-based  lesion  segmentation  method  which  outperforms  previ¬ 
ously  studied  lesion  segmentation  methods. 

Development  of  a  model-based  probabilistic  segmentation  method  which  outperforms 
previously  studied  lesion  segmentation  methods. 

•  Analyzed  the  ability  of  Bayesian  ANNs  to  approximate  the  ideal  observer  given  simu¬ 
lated  datasets.  These  results  aided  in  the  design  of  Bayesian  ANN  classifiers  for  CAD 
systems. 

•  Development  of  a  multiobjective  classifier  training  methodology  which  directly  opti¬ 
mized  the  ROC  curve  of  a  classifier.  This  approach  has  been  found  to  be  very  useful 
for  rule-based  classifiers. 

•  Analysis  of  the  fundamentals  of  the  feature  selection  problem  which  helps  one  better 
understand  the  difficulties  associated  with  feature  selection. 

•  Development  of  a  genetic  algorithm  feature  selection  methodology. 

•  Analysis  and  comparisons  of  the  various  feature  selection  methods  which  will  aid  in 
the  selection  of  an  appropriate  feature  selection  algorithm. 

•  A  study  of  Bayesian  ANNs,  the  multiobjective  approach,  and  the  various  feature  se¬ 
lection  methods  for  the  computerized  detection  of  mass  lesions  in  mammography. 

•  A  mass  CAD  systems  that  has  both  a  simpler  and  more  understandable  methodology 
and  a  better  overall  performance. 
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8  Conclusions 

We  have  studied  some  of  the  fundamental  properties  of  feature  selection.  We  have  found 
that  the  probability  of  selecting  an  optimal  subset  of  features  rapidly  decreases  as  the  sample 
size  decreases  and  the  total  number  of  features  from  which  to  select  increases.  Understanding 
the  limitation  of  feature  selection  will  help  us  select  (using  methods  such  as  ID  analysis  and 
genetic  algorithms)  a  useful  and  robust  subset  of  features  to  be  used  in  the  computerized 
detection  of  mass  lesions  in  mammography. 

We  have  also  studied  the  use  of  Bayesian  artificial  neural  networks  in  classification  tasks. 
We  have  found  that  Bayesian  ANNs  produce  more  accurate  and,  yet,  robust  solutions  to 
classification  problems.  Bayesian  ANNs  also  train  more  rapidly  than  do  conventional  ANNs 
using  round-robin  methodology.  This  information  will  be  used  design  more  accurate  and 
robust  pattern  classifiers  for  the  computerized  detection  of  mass  lesions  in  mammography. 

We  have  developed  a  novel  classifier  training  approach  known  as  the  multiobjective  ap¬ 
proach.  This  method  has  the  advantage  that  it  directly  optimizes  the  ROC  curve  of  a 
classifier  and,  thus,  returns  the  optimal  ROC  curve  that  can  be  obtained  using  the  give 
classifier  on  the  given  training  dataset. 

We  have  introduced  a  new  initial  filtering  scheme  to  detect  mass  lesions  in  mammog¬ 
raphy.  The  performance  of  feature  selection  methods  and  of  pattern  classifiers  is  limited 
by  the  performance  of  the  initial  detection  algorithm.  We  have  shown  that  RGI  filtering 
substantially  outperformed  the  previous  method  of  detecting  mass  lesions  known  as  bilateral 
subtraction. 

Finally,  we  have  applied  all  of  the  above  techniques  to  the  computerized  detection  of 
mass  lesions  in  mammography  and  produced  a  CAD  system  that  is  both  simpler  and  has  a 
better  overall  performance  over  previous  techniques. 
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Abstract 

It  is  well  understood  that  the  optimal  classification  decision  variable  is  the  likelihood  ratio  or  any 
monotonic  transformation  of  the  likelihood  ratio.  An  automated  classifier  which  maps  from  an  input 
space  to  one  of  the  likelihood  ratio  family  of  decision  variables  is  an  optimal  classifier  or  ‘‘ideal  observer.” 
Artificial  neural  networks  ( ANNs)  are  frequently  used  as  classifiers  for  many  problems.  In  the  limit  of  large 
training  sample  sizes,  an  ANN  approximates  a  mapping  function  which  is  a  monotonic  transformation 
of  the  likelihood  ratio,  i.e.,  it  estimates  an  ideal  observer  decision  variable.  A  principal  disadvantage  of 
conventional  ANNs  is  the  potential  over-parameterization  of  the  mapping  function  which  results  in  a  poor 
approximation  of  an  optimal  mapping  function  for  smaller  training  samples.  Recently,  Bayesian  methods 
have  been  applied  to  ANNs  in  order  to  regularize  training  to  improve  the  robustness  of  the  classifier.  A 
Bayesian  ANN  should  thus  better  approximate  the  optimal  decision  variable  given  small  sample  sizes.  We 
have  evaluated  the  accuracy  of  Bayesian  ANN  models  of  ideal  observer  decision  variables  as  a  function 
of  the  number  of  hidden  units  used,  the  signal-to-noise  ratio  of  the  data,  and  the  number  of  features  or 
dimensionality  of  the  data.  We  show  that  when  enough  training  data  are  present,  excess  hidden  units 
do  not  substantially  degrade  the  accuracy  of  Bayesian  ANNs.  However,  the  minimum  number  of  hidden 
units  required  to  best  model  the  optimal  mapping  function  varies  with  the  complexity  of  the  data. 

Keywords 

Bayesian  neural  networks,  ideal  observers,  ROC  analysis,  computer-aided  diagnosis 


I.  Introduction 

In  image  analysis  and  computer-aided  diagnosis,  automated  classifiers  are  often  em¬ 
ployed  to  determine  whether  a  region  within  an  image  is  abnormal  (with  a  specified 
disease)  or  normal  (without  that  disease)  [1-3].  Typically,  features  are  extracted  from 
a  suspicious  site  to  form  a  vector  of  input  features  which  is  projected  or  mapped  onto  a 
scalar  decision  variable  using  the  parameters  of  the  classifier.  A  threshold  is  then  applied 
to  this  decision  variable  to  determine  whether  the  input  features  are  representative  of  an 
abnormal  or  normal  region.  Before  any  classification  can  be  performed,  the  parameters  of 
the  classifier  must  be  determined.  Classifier  “training”  involves  using  a  dataset  of  obser¬ 
vations  or  features  from  both  the  normal  class  and  the  abnormal  class  to  determine  the 
parameters  of  the  classifier  so  it  will  perform  acceptably  on  future  datasets  of  unknown 
pathology. 

It  is  well  understood  that  the  optimal  classification  decision  variable  is  the  likelihood 
ratio  or  any  monotonic  transformation  of  the  likelihood  ratio  [4-6].  The  ROC  curve  [4,7—9] 
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produced  using  one  of  the  likelihood  ratio  family  of  decision  variables  is  the  optimal  ROC 
curve  which  best  describes  the  limiting  tradeoffs  between  sensitivity  and  specificity.  Any 
system  that  uses  the  likelihood  ratio  or  a  monotonic  transformation  of  the  likelihood  ratio 
to  make  decisions  is  known  as  an  ideal  observer  [10]. 

A  classification  artificial  neural  network  (ANN)  can  be  viewed  as  a  highly  parameter¬ 
ized  classifier  or  mapping  function  [11—14].  The  output  of  an  ANN  in  the  limit  of  large 
sample  sizes  approximates  a  mapping  function  which  is  a  monotonic  transformation  of 
the  likelihood  ratio  [12,15].  *.e.,  it  approximates  an  ideal  observer  decision  variable.  A 
principal  disadvantage  of  conventional  ANNs  is  the  possible  over-parameterization  of  the 
mapping  function,  which  results  in  a  poor  approximation  of  the  optimal  mapping  func¬ 
tion  given  small  training  sample  sizes.  Recently,  Bayesian  methods  have  been  applied  to 
ANNs  [16, 17]  in  order  to  regularize  training  to  improve  the  robustness  of  the  classifier.  A 
Bayesian  ANN  should  thus  better  approximate  the  optimal  decision  variable  given  small 
sample  sizes.  Many  researchers  have  analyzed  the  performance  of  classification  ANNs  by 
evaluating  classification  ability  through  ROC  analysis  on  training  and  testing  datasets 
[18. 19].  Bayesian  ANNs,  however,  are  trained  not  only  to  produce  an  ideal  observer  ROC 
curve,  but  also  to  produce  a  particular  ideal  observer  mapping  function.  Hence,  an  eval¬ 
uation  of  a  Bayesian  ANN’s  ability  to  approximate  this  particular  mapping  function  is  of 
fundamental  importance. 

In  this  paper,  we  investigate  the  accuracy  of  Bayesian  ANN  models  of  an  ideal  observer 
decision  variable  given  simulated  datasets  of  various  sizes  and  neural  networks  with  a 
single  hidden  layer.  Section  II  contains  a  brief  introduction  to  the  optimal  classifier  decision 
variable,  Bayesian  ANNs,  and  the  connection  between  the  two.  Section  III  examines  a  one¬ 
dimensional  example  in  detail  to  provide  insights  regarding  Bayesian  ANNs.  Section  IV 
describes  the  methods  used  for  implementing  a  Bayesian  ANN  and  evaluating  its  accuracy. 
In  Section  V,  we  study  the  effects  of  sample  size,  input  dimension,  number  of  weights,  and 
signal-to-noise  ratio  of  the  data  on  the  accuracy  of  Bayesian  ANNs.  Finally,  Sections 
VI  and  VII  provide  a  discussion  of  the  results  and  a  summary  of  the  advantages  and 
disadvantages  of  Bayesian  ANNs  for  classification. 
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II.  Background 

A.  Optimal  Decision  Variables  and  ROC  Analysis 
The  task  of  an  automated  two-group  classifier  is  to  determine  whether  an  observation 
comes  from  the  abnormal  class  denoted  by  na  or  the  complementary  normal  class  denoted 
by  7 r„.  The  features  corresponding  to  an  observation  can  be  expressed  as  a  D-dimensional 
random  vector  x  =  [xlt  x2, .  •  • ,  xD],  where  boldface  type  denotes  random  variables.  Pro¬ 
jection  classifiers,  such  as  ANNs,  classify  by  mapping  an  observation  onto  a  new  random 
variable  y  =  g{x),  where  g(-)  is  the  mapping  function.  An  observation  x  is  classified 
as  abnormal  if  g(x)  >  yc  where  yc  is  the  decision  variable  threshold.  One  should  note 
that  defining  the  abnormal  assignment  as  greater  than  or  equal  to  the  threshold  yc  is  an 
arbitrary  convention.  We  could,  equivalently,  have  defined  the  abnormal  class  as  corre¬ 
sponding  to  decision  variable  outcomes  that  are  strictly  less  than,  strictly  greater  than, 
or  less  than  or  equal  to  yc  but  we  will,  without  loss  of  generality,  use  the  definition  given 
above  throughout  this  paper. 

The  conditional  density  functions  of  data  from  the  abnormal  and  normal  classes  are  given 
by  Pz{x |7r0)  and  ps{x \rrn)  respectively.  Similarly,  the  conditional  density  functions  of  the 
decision  variable  are  given  by  py{y\'^a)  f°r  the  abnormal  class  and  py{y\^n)  f°r  the  normal 
class.  We  use  the  symbol  p  to  denote  both  continuous  and  discrete  density  functions, 
with  its  subscripts  denoting  the  random  variables  drawn  from  that  density  function.  The 
true-positive  fraction,  or  the  expected  fraction  of  abnormal  cases  classified  correctly,  is 
given  by 

TPF{yc )  -  J Py{yK)  dy  =  /  •  •  •  /  PstfK)  dDx,  (1) 

Vc  {x:g(x)>yc} 

whereas  the  false-positive  fraction  (the  expected  fraction  of  normal  cases  incorrectly  clas- 
sified)  is 

oo 

FPF{yc )  =  j pv(yK)  dV  =  J  -  J  PM^n)  dDx.  (2) 

Vc  { x:g(x)>yc } 

By  varying  yc  over  its  entire  range  and  plotting  the  ( FPF(yc),T PF{yc ))  pairs,  one  obtains 
a  receiver  operating  characteristic  (ROC)  curve  [4,7-9],  which  describes  the  performance 
tradeoffs  achievable  by  the  classifier. 
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It  has  been  shown  that  the  optimal  ROC  curve  is  obtained  when  the  mapping  function 
employed  is  the  likelihood  ratio 


y  =  LR(x)  = 


Px(^l^"a) 

Px(*K) 


(3) 


or  any  monotonic  transformation  of  the  likelihood  ratio  [20] .  Any  classifier  that  does  not 
use  the  likelihood  ratio  or  a  monotonic  transformation  thereof  as  its  mapping  function  is 


suboptimal. 

Another  commonly  employed  discriminant  function  is  the  posterior  probability  of  an 
abnormal  observation,  pf(7rQ|x),  called  the  Bayes  optimal  discriminant  function  [5].  Here 
t  is  the  discrete  class  membership  random  variable  which  takes  on  the  values  na  and  7rn. 
Bayes’  rule  immediately  gives 

.  .... _ Px(?\<)Pt{ 7Ta)  (4) 

Pd’l-aF)  pg(x\ira)pt(*a)  +  Vx{x\'Kn)pt{'Kn)  ' 


This  leads  to 


Pt{  Ka\x) 


kLR(x) 


(5) 


1  +  kLR(x) 

where  k  =  ?,t(S  and  LR(x)  is  given  in  Eqn.  3.  Because  Eqn.  5  is  a  monotonic  trans- 

Pt{rtn)  V  ' 

formation  of  the  likelihood  ratio  for  positive  k,  we  conclude  that  pt{ na\x)  is  an  optimal 
discriminant  function. 


B.  Modeling  and  Approximating  Probability  Functions  with  ANNs 

In  statistical  estimation,  one  often  makes  an  assumption  concerning  the  data  distribu¬ 
tion.  For  example,  the  density  function  of  a  particular  type  of  data  may  be  modeled  as 
normal  with  mean  p  and  standard  deviation  o .  which  we  represent  as  px(x |/z,  o) .  Estima¬ 
tion  theory  can  then  be  employed  along  with  a  sample  of  data  to  determine  appropriate 
values  of  the  parameter  estimates  fi  and  &  even  when  the  true  density  function  of  the 
data  Px{%)  is  not  normal.  (The  circumflex  here  indicates  an  estimated  quantity.)  In  this 
sense  the  estimation  task  is  analogous  to  curve  fitting  or  function  approximation:  the 
estimated  parameters  are  chosen  so  that  a  Gaussian  model  for  the  density  function  best 
approximates  the  underlying  density  function  from  which  the  sample  of  data  was  drawn. 

There  is  potential  for  confusion  in  using  the  vertical  bar  to  denote  nonrandom  parame¬ 
ters  of  a  function,  because  this  notation  is  identical  to  that  for  a  conditional  probability; 
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many  authors  therefore  adopt  a  different  notation  for  this  case  (a  semicolon  instead  of 
a  bar,  for  example).  However,  in  estimation  theory  it  is  often  useful  to  consider  similar 
problems  from  very  different  points  of  view,  regarding  the  ‘parameters  of  a  function  as 
nonrandom  parameters,  or  as  estimates  which  are  functionally  dependent  upon  a  set  of 
random  observational  data,  or  even  as  Bayesian  quantities  derived  from  density  functions 
(in  a  Bayesian  sense)  that  are  not  necessarily  dependent  on  the  observational  data.  Be¬ 
cause  we  wish  to  emphasize  the  similarities  between  these  approaches,  rather  than  the 
subtle  ( albeit  fundamental)  differences  between  them,  we  will  use  the  vertical  bar  notation 
to  denote  (a)  dependence  on  nonrandom  parameters  of  a  function  (such  parameters  will 
be  clearly  stated  as  being  nonrandom) ;  ( b )  conditional  dependence  on  the  particular  value 
indicated  of  a  random  variable  (such  variables  will  be  clearly  stated  as  being  random);  (c) 
conditional  dependence  on  the  particular  value  of  an  estimate  derived  from  random  obser¬ 
vational  data  (distinguished  from  ( b )  by  a  circumflex  indicating  an  estimated  quantity); 
and  (d)  a  conditional  density  considered  as  a  function  of  the  random  variable  (or  estimate) 
upon  which  it  is  conditional  (distinguished  from  ( b )  and  (c) ,  respectively,  by  boldface  type 
indicating  a  random  variable). 

Thus  in  an  expression  such  as  py(y\x,  pX5  w),  x  represents  a  particular  value  of  the 
random  variable  x  upon  which  y  depends  conditionally  (we  assume  that  x  was  previously 
introduced  as  a  random  variable);  (ix  is  a  particular  value  of  the  estimate  £ix.  upon  which 
y  also  depends  conditionally;  and  w  is  a  set  of  other  (nonrandom)  parameters  of  the 
conditional  density  function  of  y. 

In  the  normal  example  described  above,  the  relevant  density  function  was  modeled 
as  a  function  dependent  upon  a  small  number  of  parameters  (namely  p  and  a).  One 
may  also  consider  more  complicated  cases  in  which  the  underlying  model  for  the  density 
functions  from  which  observational  data  are  drawn  does  not  have  such  a  simple  closed- 
form  representation  and  may  depend  upon  a  large  number  of  parameters  w.  The  ANN 
function  is  one  such  example.  An  ANN  is  a  set  of  connected  nodes  based  loosely  on  the 
human  neuron  system  that  maps  a  generally  multidimensional  input  vector  x  to  a  generally 
multidimensional  output  vector  y  using  the  parameters  w  [11-14, 16].  In  this  paper  we  will 
be  dealing  with  ANNs  that  map  x  to  a  scalar  value  y.  The  output  of  this  ANN  function 
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is  given  by 

y  =  s(^(n-l).f(n-l)+4"-1))  (6) 

where  s  is  a  sigmoidal  activation  function  ( e.g .  tanh  or  the  logistic  cumulative  distribution 
function),  and  where  .t'2-1  represents  the  output  of  the  ith  hidden  layer  of  the  ANN,  defined 
recursively  as 

where  =  x,  the  input.  Readers  unfamiliar  with  ANNs  may  consult  references  [11,12]. 

Since  an  ANN  architecture  of  sufficient  complexity  is  known  to  be  able  to  closely  approx¬ 
imate  any  continuous  function  [21],  it  is  reasonable  to  assume  that  nearly  any  underlying 
density  function  for  x  can  be  closely  approximated  by  a  function  in  this  family  of  functions 
for  some  value  of  a  sufficiently  large  set  of  parameters  w.  As  explained  above,  we  write 
such  an  approximation  as  ps{x \w)  to  emphasize  that  this  is  an  approximation  to  some 
underlying  function  Px(x)- 

C.  Maximum  Likelihood  Estimation  of  ANN  Weights 

Since  pt(7r0 |s)  in  Eqn.  5  is  an  optimal  discriminant  function,  we  can  treat  the  task  of 
classification  as  one  of  modeling,  or  approximating,  the  function  pt{ na\x).  As  outlined 
in  the  previous  section,  we  approximate  the  optimal  discriminant  function  by  an  ANN 
function,  writing  it  as  pt{ na\x,  w),  the  range  of  which  is  bound  between  0  and  1.  We  then 
define  pt{irn\x,  w)  =  1  ~p{*a\x,  w).  The  practical  task,  given  an  actual  ANN  with  a  finite 
number  of  weights  and  a  sample  of  training  data,  is  to  choose  weights  w  such  that  the 
difference  between  the  ANN  output  pt(na\x,w)  and  the  true  Bayes  optimal  discriminant 
function  Pt(^a\S)  is  small. 

ANN  training  involves  using  a  training  dataset  {X,  T)  to  determine  a  value  of  w,  where 
X  =  is  the  set  of  training  feature  vectors,  T  =  {ij£Li  is  the  set  of  known  truth 

(t ra  or  7rn)  for  each  feature  vector,  and  N  is  the  total  number  of  samples  in  the  training 
dataset.  Assuming  the  data  to  be  independently  sampled,  the  true  likelihood  of  the  data 
is  given  by 

PX,t{N,  T)  —  JJ  Pt{ti\xi)Px{xi) ■ 

t=l 
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If  pt( na\x)  is  approximated  by  an  ANN  function  pt(ira\x,w),  then  the  likelihood  of  the 
data  assuming  the  ANN  model  and  given  a  particular  value  of  the  parameters  w  is 

N 

PX,t{X,T\w)  —  w)px{%i)-  (9) 

i— 1 

Maximizing  the  likelihood  of  the  data,  px,t{X^T\vj):  with  respect  to  w  is  equivalent  to 

^ML  . 

maximizing  px(T\X ,  w)  with  respect  to  w.  The  maximum  likelihood  estimate  w  is  thus 
obtained  by  maximizing  ^ 

pT(T\X,w)  =  f[pt(U\xuw)  (10) 

z= 1 

with  respect  to  w.  The  logarithm  of  the  above  expression  is  known  as  the  cross-entropy 
error  function  [12]. 

Ruck  et  al  [15]  proved  that  an  ANN  trained  using  the  sum  of  squares  error  function 
approaches  the  Bayes  optimal  discriminant  function  in  the  limit  of  infinite  training  data. 
We  wish  to  show  that  maximizing  the  utility  function  in  Eqn.  10  also  yields  a  discriminant 
function  that  approaches  the  Bayes  optimal  discriminant  function  given  infinite  training 
data.  Taking  the  logarithm  of  Eqn.  10  yields 

N 

ln(pT(T\X,w))  =  ^ln(pt(^|f2-,u;)).  (11) 

i= 1 

Taking  the  limit  as  N  approaches  infinity  and  employing  the  Strong  Law  of  Large  Numbers 
[22],  we  arrive  at 

ln(pr(T|  X,  w))  =  j  ■■■  J  [ln(pt(7ra|f,  w))ps{x,  na)  +  ln(pt(7r„|f,  w))ps{x,  7rn)]  dD  x.  (12) 

Since  a  sufficiently  large  ANN  is  known  to  be  able  to  closely  approximate  any  continuous 

function  [21],  it  is  reasonable  to  assume  that  pt(ira\x,w)  can  take  on  any  functional  form 

of  interest  here.  The  task  of  finding  the  w  that  maximizes  Eqn.  12  is  replaced  with  the 

more  tractable  problem  of  finding  the  function  pt(jTa\x,w)  that  maximizes  a  particular 

functional,  namely  the  integral  on  the  right  side  of  Eqn.  12. 

Using  the  calculus  of  variations  [23],  one  can  show  that  Eqn.  12  is  maximized  when 

N 

pt(ira\x,w)  -  Pt{ 7Ta|f).  It  follows  that  pXJ(X,T\w)  =  JI  Pt{U\xl,w)Pr,{xi)  is  maximized 

1=1 

-  ML  _  ^ML  .  _ 

in  the  limit  of  infinite  data  by  the  w  such  that  pt(7ia\x,w  )  =  pt{TTa\x). 
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This  result  is  of  limited  practical  utility  because  one  does  not  have  infinite  data  from 
which  to  estimate  ijML .  In  practice,  when  given  a  sample  of  data  {X,T}  from  the  popu¬ 
lation.  the  ML  ANN  model  attempts  to  approximate  the  sample  Bayes  optimal  discrimi¬ 
nant  function,  which  is  one  everywhere  an  abnormal  observation  is  located  in  the  training 
dataset,  zero  everywhere  a  normal  observation  is  located  in  the  training  dataset,  and  ar¬ 
bitrary  (or  undefined)  elsewhere.  This  discontinuous  behavior  is  one  of  the  drawbacks  of 
the  maximum  likelihood  method  that  the  Bayesian  methods  attempt  to  address. 


D.  Bayesian  Estimation  of  ANN  Weights 


We  have  argued  that  the  maximum  likelihood  method  approximates  the  sample  Bayes 
optimal  discriminant  function  when  one  has  finite  training  data.  It  is  necessary,  how¬ 
ever.  to  approximate  the  population  Bayes  optimal  discriminant  function,  because  the 
sample  Bayes  optimal  discriminant  function  is  of  little  practical  use.  This  is  convention¬ 
ally  performed  by  regularization  methods  such  as  early  stopping.  [11,24]  and  weight  decay 


[13,25,26].  Recently,  Bayesian  methods  have  been  used  to  regularize  the  training  process 
to  approximate  the  population  Bayes  optimal  discriminant  function  rather  than  the  sample 
discriminant  function  [16,17],  Training  a  Bayesian  ANN  involves  choosing  the  maximum 
a  posteriori  (MAP)  weight  vector  w  which  maximizes 


p$(w\X,T,  a)  = 


pxjjX,  T\w)pis{w\a) 
JpxAx >  T\w)pAw\a)dw 


(13) 


where  Px,t{X,  T\w)  is  given  in  Eqn.  9,  and  the  “prior”  Pa(tif|3)  is  the  regularization  term 
that  employs  the  parameters  a  to  incorporate  our  prior  belief  of  what  constitute  reasonable 
values  of  w.  It  should  be  noted  that  Px,t{N.  T\w )  in  Eqn.  13  is  now  a  true  conditional 
probability  in  the  Bayesian  sense  [27],  whereas  w  was  a  nonrandom  parameter  of  the 
underlying  model  in  Eqn.  9.  The  term  Px,t{x does  not  depend  on  a ,  because 
we  specifically  model  this  function  as  depending  only  on  the  weight  vector  w.  The  prior 


is  often  modeled  as 

/  I  Vv  \ 

PtfHc?)  =  Cexp  a{wf  J  (14) 

where  oti  is  the  regularization  constant  for  the  ith  weight  wz.  C  is  a  constant  ensuring  that 
p.$(w 1 5)  is  a  properly  normalized  density  function,  and  W  is  the  number  of  elements  in 
the  weight  vector  w  [12,16,17].  Equation  14  attempts  to  limit  the  magnitude  of  the  weight 
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vector  w,  because  it  has  been  shown  that  large  weight  values  correspond  to  complicated 
mapping  functions  [16,28]. 

The  denominator  in  Eqn.  13,  referred  to  hereafter  as  the  evidence,  is  often  written  as 


PxAx>T  I5)  =  /  Px,T{X,T\w)pvi{w\a)dw. 


(15) 


We  are  faced  with  the  same  problem  that  motivated  Eqn.  13,  that  is,  choosing  appropriate 
values  of  a.  We  again  adopt  a  Bayesian  view  of  the  relevant  parameters  and  use  Bayes’ 

rule  to  arrive  at 


/  —t  -tr  /ti\  Px,r(^7la)ps(a)  rifil 

vMX,t)  =  pxt(X:T) 

If  we  assume  that  ps(a)  is  a  flat  prior,  then  maximizing  ps(a\X,T)  with  respect  to  a  is 


equivalent  to  maximizing  px,r(A,  T\a),  or  the  evidence,  with  respect  to  a.  In  principle,  one 
could  use  the  data  to  first  determine  the  paramters  a  using  Eqn.  15,  and  then  determine  the 
parameters  of  the  neural  network  w  using  Eqn.  13.  In  practice,  however,  it  is  prohibitively 
time-consuming  to  evaluate  the  integral  in  Eqn.  15,  so  approximations  of  the  integral  are 


often  employed  [16].  For  a  more  detailed  discussion  of  Bayesian  ANNs  and  the  methods 
used  to  determine  the  regularization  constants  a,  see  references  [16]  and  [17]. 

As  previously  stated,  Bayesian  methods  are  useful  primarily  when  one  has  finite  training 
data.  It  is  still  important,  however,  for  the  Bayesian  method  to  arrive  at  the  Bayes  optimal 
discriminant  function  in  the  limit  of  infinite  training  data  in  order  for  the  estimator  to 
be  consistent.  We  wish  to  show  that  maximizing  the  right  side  of  Eqn.  13  given  infinite 
data  again  leads  to  the  Bayes  optimal  discriminant  function.  Note  that  the  denominator 
of  Eqn.  13  does  not  depend  on  w,  whereas  the  numerator  is  the  same  as  Eqn.  9  but  with 
an  extra  factor  p^{w).  Taking  the  logarithm  of  Eqn.  13,  we  arrive  at 


N 

In  (pis(w\X,T,a))  =  ^\n{pt(U\xl,w)pJw\a)1/N)+ 
i— 1 

£>(p*(fi))  -  In  (/ PxAX,T\™)pM3)dW™)  ■  (17) 

2=1 

Note  that  the  maximization  of  Eqn.  17  with  respect  to  w  affects  only  the  first  term  on  the 
right  side  and  that,  with  respect  to  x{  and  U,  this  first  term  is  similar  in  form  to  the  right 
side  of  Eqn.  11.  The  remaining  arguments  from  the  previous  section  are  unchanged,  and 
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it  can  readily  be  shown  that  the  same  results  are  obtained,  namely  that  p^(w\X,  T,  a)  is 

%MAP  /  |  _*  -MAP.  /  I  -X 

maximized  in  the  limit  of  infinite  data  by  the  w  such  that  pt\jra\x,w  )  —  Pt{Ka\%)- 
As  in  the  maximum  likelihood  situation,  this  result  is  of  little  practical  importance 
because  one  never  has  infinite  data.  We  also  know  that  the  maximum  likelihood  method 
performs  poorly  with  “small”  training  datasets  because  the  maximum  likelihood  method 
attempts  to  approximate  the  sample  Bayes  optimal  discriminant  function  instead  of  the 
population  Bayes  optimal  discriminant  function.  It  is  not  known,  however,  how  well 
the  Bayesian  method  performs  with  limited  training  data.  In  this  work  we  determine 
empirically  the  ability  of  Bayesian  ANNs  to  approximate  the  Bayes  optimal  discriminant 
function  when  trained  using  finite  datasets. 


III.  One- Dimensional  Study 


Before  proceeding  to  the  methods  and  results  of  the  experiments,  we  will  present  a  one¬ 
dimensional  example  of  the  behavior  of  a  Bayesian  ANN:  This  will  allow  us  to  visually 
present  and  explain  the  Bayesian  ANN  at  a  greater  level  of  detail  than  possible  in  the 
higher  dimensional  experiments.  Our  goal  in  this  section  is  to  provide  a  framework  for 
the  interpretation  of  the  methodology  and  results  which  will  be  presented  in  the  next  two 
sections. 

Given  the  ID  density  functions  defined  by 

P,(zK)  =  (s)  '  exP  (-|)  (18> 


Px(x  |7T0)  =  ( — 


.2  \  2/2 


( bx  —  a)2 
2 


with  a  and  b  >  0,  we  can  compute  the  theoretical  optimal  mapping  function  pt{ 7rQ|x)  using 
Eqn.  5,  which  results  in 


9  =  =  1  +  »expH((fa-°F -*’))' 

Using  the  theory  of  random  variables  [29],  we  can  determine  the  conditional  density  func¬ 
tions  of  the  new  random  variable  y  =  pt(Tra\x)  using 

v  (vlt)=  i*)  \ ...  i  p^(oifl  +  ...  (2i) 

PV ^  bf(7Ta|Z(l))|  \Pt(Ka\X(l))\ 


kb  exp  (-|((i>:r  —  a)2  —  x2)) 
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where  t  is  either  na  or  irn,  x{l)  denotes  the  Ith  root  of  Eqn.  20  for  a  specific  y,  and  p[Mx{l)) 
is  the  derivative  of  pt(ira|x)  with  respect  to  x  evaluated  at  x{l)  [29].  For  the  special  case 
b  =  1  and  a  >  0,  Eqn.  20  has  only  one  root  given  by 


=  (22 
Combining  Eqns.  20,  21,  and  22,  we  obtain  the  b  =  1  conditional  density  functions  of  y, 


PvivWn)  = 


ay{l-y)V2n 


exp  [ (7  log 


2[a^\k(l-y)J  '  2 


Py(y\*a)  ~  ^  _  y)^  ^P  ^  2  (fc(1  _  y)J  2j 

for  0  <  y  <  1. 

The  b  7^1  case  is  more  complicated  because  Eqn.  20  now  has  two  roots 


X(1)_62-l  \(P-l)2  b2  -  1  l0g  \kb{l  -  y) 


x(2)  -  b2  _  i  +  aJ  (&2  _  1)2  b2  -  1  6  \kb{  1  -y))' 

Using  Eqns.  25,  26  and  20  with  21  we  obtain  the  6  ^  1  conditional  density  functions  of  y 


PyivWn)  = 


PxQE(  l)Kn)  +Px(g(2)kn) 

3/(1  -  y)>2  -  W  -  l)log(fefc^y) 


Py(yka)  = 


Px(S(l)K)  +  Px(S(2)K) 

2/(1  -  2/)x/«2  -  2(b2  —  1>1°g(fcb(^)) 


The  values  of  y  are  bound  between  0  and  1  due  to  the  form  of  Eqn.  3.  Because  the  square 
root  terms  in  Eqns.  25  and  26  must  be  non-negative,  the  value  of  b  further  restricts  the 
range  of  possible  y  (or  p{ira\x))  values  when  b^  1,  i.e.,1 


b>  1  0  <  y  <  ^1  - 

’The  symbol  is  known  as  the  “leads  to”  symbol. 


kb  exp  (2(#3I))  +  1/  ’ 
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b  <  1  ^1  - 

and 


13 

7  1  2  < — 1  <  y  <  l, 

(30) 

(2(62-1))  + 1) 

0  <  y  <  1. 

(31) 

Given  a  sample  of  ID  data  taken  from  the  densities  shown  in  Eqiis.  37  and  38,  the 
Bayesian  ANN  should  model  the  mapping  function  given  in  Eqn.  20.  To  show  this,  we 
sampled  500  normal  observations  and  500  abnormal  observations  (N  =  1000)  from  the 
density  functions  shown  in  Fig.  1  (a  =  1  and  b  =  0.5).  We  trained  a  Bayesian  ANN 
with  10  hidden  units  on  this  data  to  produce  pt{Tra\x,w  ),  which  is  shown  in  Fig. 
2  along  with  the  theoretical  true  mapping  function  pt(7ra|x)  (Eqn.  20).  (The  training 
method  will  be  described  in  the  next  section.)  Because  the  Bayesian  ANN  mapping 
function  approximates  the  true  mapping  function  pt( 7ra|x),  we  conclude  that  the  output 
y ^  —  Pt{^a\xii  rc  )j  i  —  1, . . . ,  N  of  the  Bayesian  ANN  using  the  training  data  as  input 
should  behave  as  if  sampled  from  the  density  functions  given  in  Eqns.  25-28.  Figure  3  shows 
both  the  theoretical  density  functions  py{y\nn)  and  py{y\^a)  for  the  a  =  1,  b  =  0.5  case, 
together  with  the  histograms  of  the  abnormal  and  normal  training  data  output  from  the 
Bayesian  ANN.  The  histograms  of  the  Bayesian  ANN  decision  variable  closely  match  the 
densities  of  the  ideal  observer  decision  variable.  Finally,  one  can  generate  the  theoretical 
ROC  curve  using  Eqns.  1  and  2  and  the  theoretical  densities  of  the  ideal  decision  variable 
Py(y\Tia)  and  py{y\nn).  This  can  then  be  compared  (see  Fig.  4)  with  the  ROC  curve 
generated  using  the  Bayesian  ANN  output  of  an  independently  sampled  testing  dataset 
using  the  “proper”  curve  fitting  method  [6,30].  Again,  the  true  optimal  ROC  curve  and 
the  ROC  curve  produced  using  the  output  of  the  Bayesian  ANN  are  similar. 


IV.  Methods 


A.  Bayesian  ANN  Implementation 

In  this  work,  we  have  employed  a  neural  network  with  an  input  layer,  a  single  hidden 
layer,  and  a  single  output  node.  Our  implementation  of  the  Bayesian  ANN  is  based  on 
the  work  of  MacKay  [16].  An  approximation  is  made  in  MacKay’s  methods  that  should 
be  noted.  In  order  to  determine  the  regularization  parameters  a,  one  must  evaluate  the 
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integral  in  Eqn.  15.  In  MacKay’s  method,  the  integrand  px,T(X,T\w)pus(w\a)  is  locally 
approximated  using  a  Taylor  expansion  around  the  most  probable  weight  values  w  as 
being  Gaussian.  The  integral  in  Eqn.  15  can  then  be  evaluated  explicitly.  This  changes  the 
method  of  optimization,  however,  because  the  evidence  now  depends  on  this  w  ,  which 
in  turn  depends  upon  the  choice  of  oe.  Hence,  a  dual  optimization  is  performed  in  which 
one  fixes  the  parameters  a  to  find  wMP,  then  uses  wMP  to  find  a  new  set  of  parameters 
a.  and  then  repeats  this  process  a  fixed  number  of  times.  At  this  point,  the  parameters 
<3  are  assumed  to  be  properly  determined  and  the  w  can  be  labeled  the  maximum  a 
posteriori  value  of  the  weights  wMAP.  It  has  been  determined  empirically  that  eight  such 
iterations  work  well  for  most  applications.  In  other  methods,  such  as  those  discussed  by 
Neal  [17].  the  integral  in  Eqn.  15  is  evaluated  using  Monte  Carlo  methods.  This  paper 
will  not  evaluate  the  validity  of  the  assumption  made  in  using  MacKay’s  methods,  instead 
focusing  on  the  accuracy  of  Bayesian  ANNs  under  these  assumptions. 

The  optimizations  were  performed  using  variable  metric  (or  quasi-Newton)  techniques 
[31]  with  a  tolerance  of  1.0  x  10“7  and  a  maximum  of  1000  iterations.  Numerical  methods 
were  employed  to  determine  the  covariance  matrix  of  the  Gaussian  approximation  used  to 
evaluate  the  evidence  [12].  In  practice,  the  constants  a  are  constrained  so  that  only  three 
distinct  values  are  used:  one  for  the  hidden  layer  weights,  one  for  hidden  layer  biases, 
and  one  for  output  layer  weights  and  biases.  Reference  [16]  provides  details  concerning 
Bayesian  ANNs  and  the  implementation  we  used. 


B.  ROC  Analysis  with  Bayesian  ANNs 
Given  a  Bayesian  ANN  mapping  function 


,  ~MAP^ 

y  =  Pt{  TTa\X,W  ) 


(32) 


and  the  density  functions  py{y \irn)  and  py(y\na),  one  could  produce  the  ROC  curve  for  this 
classifier  using  Eqns.  1  and  2.  It  is  interesting  to  note  that  a  Bayesian  ANN  approximates 
an  optimal  mapping  function  without  explicitly  modeling  any  of  the  density  functions 
ps{x\nn),  ps{x\ra),  py{y\TTn)  or  py(y\ira).  We  can  therefore  use  a  Bayesian  ANN  to  model 
the  optimal  mapping  function  from  feature  space  x  to  decision  space  y.  but  we  cannot 
employ  a  Bayesian  ANN  to  produce  a  continuous  estimate  of  the  optimal  ROC  curve. 
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Metz  and  Pan  [6,  30]  developed  a  parametric  ROC  curve  fitting  model  that  assumes  an 
underlying  binormal  model  and  uses  likelihood  ratio.  The  ROC  curves  produced  here  were 
generated  using  the  maximum  likelihood  estimates  of  the  distribution  parameters  under 
this  “proper”  binormal  model  given  testing  dataset  outputs  of  the  Bayesian  ANN. 


C.  Methods  of  Evaluation 

ROC  analysis  cannot  directly  assess  the  performance  of  Bayesian  ANNs  in  the  task  of  ap¬ 
proximating  Pi(7ra|x),  because  monotonic  transformations  of  pt(ira\x)  will  produce  identical 

MAP 

ROC  curves.  Figure  5  shows  two  different  mapping  functions  pt( na\x)  and  pt{na\x,  w  ) 
and  their  corresponding  ROC  curves.  Because  these  two  mapping  functions  are  monotonic 
transformations  of  one  another,  the  ROC  curves  produced  by  these  mapping  functions  are 
identical.  In  this  work,  however,  we  take  the  stance  that  the  task  of  a  Bayesian  ANN  is 
not  only  to  generate  the  ideal  observer  ROC  curve,  but  also  to  approximate  accurately  the 
particular  ideal  observer  decision  variable  pt(rra\x).  Although  the  ideal  observer  ROC  will 
be  obtained  if  a  Bayesian  ANN  produces  pt(na\x)  exactly,  the  primary  task  in  training 
Bayesian  ANNs  is  not  to  optimize  with  respect  to  the  ROC  curve  but  with  respect  to  the 
underlying  decision  variable.  We  therefore  believe  that  an  analysis  of  the  Bayesian  ANN  s 
ability  to  approximate  the  specific  decision  variable  pt(na\x)  is  of  fundamental  importance. 
Other  methods  of  training  that  directly  acknowledge  the  multiobjective  nature  of  classifier 
training  can  be  employed  to  directly  optimize  with  respect  to  the  ROC  curve  and  not  with 
respect  to  an  underlying  decision  variable  [20,32]. 

,  _  t.MAP 

As  we  have  discussed,  a  Bayesian  ANN  produces  an  approximation  pt{Tra\x,w  )  ot 
pt{Tra\x).  If  the  densities  ps(x\rca)  and  ps(x \rrn)  are  known,  we  can  compute  the  theoretical 
pt{ 7ra|x)  using  Eqn.  5.  The  mean-squared-error  (MSE)  between  the  true  optimal  mapping 
function  pt(7Ta|5)  and  the  Bayesian  ANN  model  of  this  function  using  one  particular  sample 
of  training  data  S  =  {X,T}  is  defined  as 


e(S)  =  Es{(pt{i Ta\x,wM  )  -pt( 7ra|x))2}  = 

OO  OO  -  2 

J  •••  J  Px{x)  pt{-Ka\x,wMAP)  -Pt(7ra|x)j  dPx. 


(33) 


~_MAP 

Here  w 


is  not  random  because  it  is  the  specific  set  of  weights  determined  from  the 
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data  S.  This  expectation  value  becomes  difficult  to  compute  as  the  dimensionality  of  the 
input  data  increases,  so  we  estimate  e(S)  from  a  testing  dataset  S'  =  { X  ,  T  }  with  N 


observations: 


i  jiL.  r  .  .  *  map  , 
e(S,  S')  =  —  w  )  -  pt{^a\Xi) 


where  x[  is  the  ith  observation  in  the  testing  dataset  S'.  Equations  33  and  34  measure 
the  difference  between  the  true  optimal  mapping  function  Pt  Aa\x)  &nd  one  particular 
Bayesian  ANN  model  of  this  function  weighted  by  the  density  of  the  data  ps{x).  We 
wish  to  measure  not  only  the  difference  between  a  single  Bayesian  ANN  model  and  the 
true  optimal  mapping  function,  but  also  the  robustness  of  the  methodology.  Therefore  we 
average  multiple  observations  of  e(S .  S')  where  S  is  now  random,  i.e., 

1  M 

*-£  £*<*.*;>•  (35> 
Here  St  and  S'  are  distinct  and  independently  sampled  pairs  of  training  and  testing 
datasets.  It  is  important  to  note  that  Eqn.  35  measures  the  average  MSE  over  M  different 
Bayesian  ANN  models  with  M  different  estimates  of  w  .  Similarly,  an  estimate  of  the 
variance  of  fi,  the  standard  error  squared,  is  given  by 

-I  M 

(36) 

For  all  the  studies  we  performed,  the  number  (M)  of  different  datasets  used  to  estimate 
the  sample  mean  and  standard  error  was  fixed  at  100. 

D.  Simulations 

In  all  simulation  studies  reported  in  this  paper,  we  sampled  data  using  an  isotropic 
Gaussian  _ 


PsiA^n) 


l\r/2-  (  As? 

s)  expl-gT 


with  zero  mean  and  unit  marginal  standard  deviations  for  the  normal  class  and  an  isotropic 


Gaussian 


,-1  ,  (,r\r"'2  (  V- lhx- - a>1 

Pi(l|>r.)  =  (  2ir )  exp  -g— ^ 


with  marginal  means  a/b  and  marginal  standard  deviations  of  1/6  for  each  feature  for  the 
abnormal  class.  Here  D  is  the  dimension  of  x  or,  equivalently,  the  number  of  features  used. 
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As  proposed  and  derived  in  reference  [18],  the  signal-to-noise  ratio  (SNR)  for  this  data 
is  defined  such  that  the  “signal”  is  the  difference  between  the  means  of  the  distributions 
and  the  “noise”  is  the  root-mean-square  standard  deviation  of  the  two  distributions.  This 

results  in  - - 

SNR(°>  =  a  =  VD  SNR«  (39) 

V  IT  +  1 

where  SNR(1)  is  the  signal-to-noise  ratio  of  one-dimensional  data  with  parameters  a  and 
b. 

We  studied  the  accuracy  of  Bayesian  ANN  models  as  a  function  of  the  number  of  training 
samples  N,  the  number  of  hidden  units  employed  H .  the  SNR  of  the  data,  and  the 
dimension  of  the  data  D.  The  relationship  between  the  number  of  hidden  units  H  and 
the  total  number  of  weights  W  for  a  single-output  classification  neural  network  is  given 
by  W  =  {D  +  2 )H  +  1. 

V.  Results 

Figure  6  shows  the  performance  of  the  Bayesian  ANN  with  varying  numbers  of  hidden 
units  H  and  input  dimensions  D  for  a  fixed  signal-to-noise  ratio  SNR  =  1.26  and  sample 
size  N  =  200.  For  the  lower  dimensional  curves  (D  =  1,  D  =  2,  and  D  =  3),  the  average 
MSE  between  the  optimal  decision  variable  and  the  Bayesian  ANN  approximation  of  that 
decision  variable  decreases  as  the  number  of  hidden  units  increases  and  then  becomes 
relatively  constant  after  a  certain  threshold.  For  D  =  2,  the  difference  becomes  relatively 
constant  after  3  hidden  units,  whereas  the  D  —  3  curve  flattens  out  after  4  hidden  units. 
More  parameters  are  required  to  better  approximate  the  optimal  mapping  function  as  the 
number  of  dimensions  increases.  The  D  —  4  and  D  =  5  curves  show  a  definite  minimum 
in  their  MSE.  The  “curse  of  dimensionality”  is  a  common  problem  caused  by  the  relative 
density  of  data  decreasing  substantially  as  the  dimensionality  of  the  data  increases.  This 
effect  is  seen  in  the  D  =  4  and  D  =  5  curves.  The  Bayesian  ANN  does  not  have  enough 
data  to  properly  approximate  the  optimal  mapping  function,  so  a  tradeoff  exists  between 
simpler  solutions  with  few  parameters  that  cannot  match  the  ideal  mapping  function  (due 
to  under-parameterization),  and  more  complicated  solutions  with  many  hidden  units  that 
cannot  be  properly  determined  due  to  the  lack  of  data.  This  effect  causes  the  minimum 
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in  the  D  —  A  and  D  =  5  curves. 

Figure  7  shows  a  similar  plot  with  varying  numbers  of  hidden  units  H  and  input  di- 
mension  D  with  a  fixed  SNR  =  1.26  but  with  more  training  samples,  N  =  1000.  The 
Bayesian  ANN  no  longer  has  any  difficulty  with  the  D  =  4  and  D  =  5  cases;  they  both 
flatten  out  after  a  certain  number  of  hidden  units.  In  this  plot,  we  can  also  more  clearly 
see  the  effect  of  the  dimensionality  of  the  data  on  the  number  of  hidden  units  required. 
The  higher  the  input  dimension,  the  more  weights  are  required  to  best  approximate  the 
optimal  decision  variable. 

The  complexity  of  the  optimal  mapping  function  is,  for  our  simulation  studies,  a  function 
of  the  SNR  of  the  data.  In  general,  SNR  by  itself  is  not  sufficient  to  characterize  the 
relationship  between  data  density  and  the  complexity  of  the  optimal  mapping  function. 
This  relationship  for  one-dimensional  data  was  explored  by  Metz  and  Pan  [6]  but  is  beyond 
the  scope  of  this  work.  For  our  purposes  it  is  sufficient  to  note  that,  for  a  fixed  b  ± 
1,  the  optimal  mapping  function  (Eqn.  20)  becomes  more  complicated  where  ps{x)  is 
nonnegligible  as  SNR  decreases,  and  becomes  more  sigmoidal  where  ps(x)  is  nonnegligible 
as  SNR  increases  (see  Fig.  8).  We  therefore  conclude  that  the  number  of  weights  needed 
to  best  model  the  optimal  mapping  function  should  change  with  SNR  for  a  fixed  b.  This 
is  shown  in  Fig.  9  where  the  SNR  is  fixed  at  the  larger  value  of  3.80.  Fewer  weights 
(free  parameters)  are  needed  to  model  the  optimal  mapping  function.  In  fact,  increasing 
the  number  of  hidden  units  H  results  in  a  gradual  increase  in  the  average  MSE  for  the 
D  —  5  curve.  This  increase  is,  however,  inconsequential  when  compared  to  the  average 
MSE  values  caused  by  having  too  few  hidden  units  in  Fig.  7. 

In  order  to  further  understand  the  effect  of  SNR  on  the  accuracy  of  Bayesian  ANNs, 
we  performed  simulation  studies  in  which  the  SNR  was  varied  between  0  and  5.  Figure 
10  shows  the  average  MSE  of  Bayesian  ANNs  as  a  function  of  SNR  for  a  fixed  H  =  4 
and  N  =  200.  The  average  MSE  remains  relatively  constant  and  then  decreases  as  SNR 
increases  for  the  D  =  1,  D  =  2  and  D  =  3  curves.  The  remainder  of  the  curves  in  Fig.  10 
show  a  decrease  in  the  average  MSE  as  the  SNR  increases.  Figure  11  shows  the  effect  of 
SNR  on  the  accuracy  of  Bayesian  ANNs  with  H  =  10  and  N  =  1000.  All  of  the  curves  in 
Fig.  11  remain  relatively  constant  until  the  average  MSE  begins  to  decrease  as  the  SNR 
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increases.  These  two  figures  show  that  the  Bayesian  ANN  can  easily  model  an  optimal 
mapping  function  that  is  sigmoidal  for  values  of  x  where  ps{ x)  is  nonnegligible  (high  SNR 
and  b  =  0.5),  but  has  more  difficulty  modeling  a  mapping  function  that  is  more  complicated 
where  ps(x)  is  nonnegligible  (low  SNR  and  b  =  0.5).  Another  method  of  describing  this 
result  is  to  note  that  for  a  high  SNR  and  b  =  0.5,  the  quadratic  separation  function 
through  the  data  can  be  approximated  by  a  line  where  the  data  are  dense,  whereas  for 
low  SNR  and  b  =  0.5,  the  separation  function  has  much  more  local  curvature  where  the 

data  are  dense. 

We  have  shown  the  effect  of  the  number  of  hidden  units  H ,  the  SNR  of  the  data,  and 
the  dimension  of  the  data  D  on  the  accuracy  of  Bayesian  ANNs.  We  have  also  indirectly 
shown  the  effect  of  sample  size  on  the  accuracy  ability  of  Bayesian  ANNs  because  Figs. 
6  and  7  have  different  sample  sizes  N.  Figure  12  more  clearly  shows  the  effect  of  sample 
size  on  the  average  MSE  between  the  true  optimal  mapping  function  and  the  Bayesian 
ANN  model  of  that  function  with  various  numbers  of  hidden  units.  As  the  number  of 
training  samples  increases,  the  average  MSE  decreases.  When  there  are  too  few  hidden 
units,  the  average  MSE  asymptotically  approaches  a  high  value  but  flattens  out  after 
just  100  samples.  When  enough  enough  hidden  units  are  used  to  properly  estimate  the 
optimal  mapping  function  (see  Fig.  7),  the  average  MSE  asymptotically  approaches  a  much 
lower  value,  but  requires  more  samples  to  do  so.  The  effect  of  adding  additional  hidden 
units  beyond  the  minimum  required  to  properly  model  the  optimal  mapping  function  has 

negligible  effect. 

VI.  Discussion 

While  the  average  MSE  between  the  optimal  mapping  function  and  the  Bayesian  ap¬ 
proximations  of  that  function  is  fundamentally  important,  it  does  not  address  all  the 
issues  relevant  to  training.  If  the  average  MSE  is  high,  one  does  not  know  whether  the 
Bayesian  ANN  mapping  function  is  too  complicated,  too  simple,  or,  possibly,  a  monotonic 
transformation  of  the  ideal  observer  mapping  function.  Typically,  when  too  few  hidden 
units  are  employed  in  the  Bayesian  ANN,  the  resulting  mapping  function  is  overly  simple, 
resulting  in  a  large  average  MSE.  When  too  many  hidden  units  are  used  without  enough 
training  data  (as  for  the  D  =  5  curve  in  Fig.  6),  we  have  found  that  the  Bayesian  ANN 
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mapping  function  can  become  too  complicated,  i.e.,  the  Bayesian  ANN  overtrains.  Figure 
13  shows  the  training  and  testing  dataset  ROC  curves  for  two  different  Bayesian  ANNs. 
Figure  13(a)  was  generated  with  D  =  5,  N  =  160,  H  =  10,  and  SNR  =  1.26.  The 
training  dataset  ROC  curve  is  clearly  well  above  the  ideal  observer  ROC  curve,  while  the 
ROC  curve  generated  using  an  independent  testing  dataset  is  well  below  the  ideal  observer 
ROC  curve.  The  Bayesian  ANN  overtrained  in  this  example.  When  more  samples  were 
used  (N  =  1000),  the  training  and  testing  dataset  ROC  curves  were  similar  to  the  ideal 
observer  ROC  curve  (Fig.  13(b)). 

As  we  have  previously  argued,  it  is  important  that  a  Bayesian  ANN  not  only  approximate 
an  ideal  observer  decision  variable,  but  that  it  approximate  the  particular  ideal  observer 
decision  variable  pt{ 7ra|:ic).  Because  the  Bayesian  ANN  models  a  probability  function,  it 
has  the  added  benefit  of  allowing  a  specific  interpretation  of  the  ANN  output.  From 
decision  theory  [11],  we  know  that  one  should  call  an  observation  x  abnormal  if 

[Ufa  K)  -  Ufafa)]ptfa\x)  >  [Ufafa)  -  t/(7r„|7rn)](l  -Ptfa  I*))  (40) 

where  U{i\j)  is  the  utility  of  classifying  an  observation  as  i  when  it  is  actually  from  class 
j.  Therefore,  if  one  can  quantify  the  utilities  associated  with  classifying  observations,  then 
the  Bayesian  ANN  output  should  be  employed  to  optimally  determine  a  threshold  at  which 
to  call  an  observation  abnormal.  If  the  output  of  an  ANN  does  not  have  this  interpretation 
then  one  could  use  the  slope  of  the  estimated  ROC  curve  (which  is  the  likelihood  ratio  over 
the  decision  variable)  to  get  an  estimate  of  Pt{^a\y)  which  could,  in  turn,  be  used  instead 
of  ptfa\x)  in  Eqn.  40.  This  requires  an  extra  estimation  step,  namely  the  estimation  of 
the  ROC  curve,  and  is,  therefore,  less  direct. 

The  Bayesian  ANN  output  can  also  be  used  to  provide  probabilities  of  malignancy  to 
radiologists  in  observer  studies  or  computerized- diagnosis  schemes.  However,  this  is  diffi¬ 
cult  with  diagnostic  classifiers  because  the  prior  probabilities  are  typically  very  different. 
The  traditional  method  of  sampling  is  to  blindly  employ  observations,  along  with  their 
corresponding  classes,  in  training  a  Bayesian  ANN.  If  the  prior  probabilities  of  the  two 
classes  are  very  different,  then  one  class  will  not  be  as  well  represented  as  the  other  class. 
For  example,  if  one  randomly  chooses  1000  mammograms  to  be  used  in  ANN  training, 
then  there  will  be,  on  average,  only  about  5  cases  with  malignant  lesions.  Typically,  for 
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diagnostic  classifiers,  one  samples  each  class  separately  so  as  to  get  equal  numbers  of 
observations  in  each  class.  This  would  cause  a  Bayesian  ANN  trained  on  such  data  to  im¬ 
properly  approximate  pt{ ira\S),  because  the  training  data  are  not  properly  sampled  from 
the  screening  population.  The  Bayesian  ANN  would,  in  fact,  approximate  the  function 

k^LRjx)  .  (41) 

1  +  k'LR{x)  ’ 

where  k'  is  the  ratio  of  malignant  to  non-malignant  training  data.  We  can,  however, 
transform  the  Bayesian  ANN  output  to  have  a  more  appropriate  k  value.  For  example, 
if  one  samples  500  malignant  observations  from  a  population  and  then  samples  500  non- 
malignant  observations  from  another  population,  the  k'  value  in  Eqn.  41  is  1.  If  the  ratio 
of  the  prior  probabilities  is  known  to  be  k  =  0.01,  then  one  can  transform  the  Bayesian 
ANN  output  to  approximate  pt{ 7ra|x)  by  solving  Eqn.  41  for  LR{x )  and  substituting  this 

function  into  Eqn.  5  using  the  appropriate  value  of  k. 

We  have  chosen  to  analyze  Bayesian  ANNs  that  employ  a  Gaussian  prior  on  the  weight 
values  with  parameters  a  because  such  priors  have  been  used  extensively  in  the  neural  net¬ 
work  community.  However,  other,  possibly  more  appropriate,  priors  have  been  developed 
and  studied.  For  example,  Williams  [33]  has  studied  the  use  of  a  Laplace  prior  for  ANNs, 
whereas  entropy-based  and  other  types  of  priors  are  discussed  by  Buntme  and  Weigend 

[34]- 

VII.  Conclusions 

We  have  shown  that  the  goal  of  training  a  Bayesian  ANN  is  to  approximate  a  particular 
ideal  observer  decision  variable.  The  ROC  curves  produced  using  two  different  decision 
variables  will  be  the  same  if  the  two  decision  variables  are  monotonically  related.  A  poor 
estimate  of  pt{ na\x)  can  yield  an  ROC  curve  equal  to  that  obtained  from  a  good  estimate 
of  pt(ira\x)  if  the  two  solutions  are  monotonically  related.  Thus,  a  measure  of  a  Bayesian 
ANN’s  accuracy  in  approximating  the  particular  ideal  observer  decision  variable  pt(TTa\x) 
is  vital.  Because  the  output  of  a  Bayesian  ANN  is  an  estimate  of  a  probability,  it  has 
other  benefits  such  as  aiding  in  the  determination  of  decision  thresholds  and  presenting 
likelihoods  of  malignancy  to  radiologists. 

As  was  expected,  we  have  shown  that  Bayesian  ANNs  better  approximate  the  optimal 
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decision  variable  pt(vra|x)  when  the  training  dataset  has  more  observations  and/or  fewer 
dimensions.  We  have  also  shown  that,  given  enough  training  data,  the  performance  of  a 
Bayesian  ANN  is  not  impaired  by  excess  hidden  units,  as  seen  in  Fig.  7.  Above  a  certain 
number  of  hidden  units,  the  average  MSE  remains  relatively  constant  despite  many  more 
parameters  being  added  to  the  model.  With  less  training  data,  however  (see  the  D  =  5 
curve  in  Fig.  6),  there  is  a  tradeoff  between  simplifying  the  mapping  function  with  fewer 
hidden  units  and  adding  more  parameters  to  allow  the  mapping  function  more  freedom. 

Figure  7  also  shows  that  a  minimum  number  of  hidden  units  is  required  to  “best” 
approximate  the  optimal  mapping  function,  because  having  too  few  hidden  units  does 
not  allow  the  mapping  function  enough  freedom  to  adequately  approximate  the  optimal 
mapping  function.  This  is  due  to  the  tradeoff  between  the  ANN  not  having  sufficient 
parameters  to  effectively  approximate  the  Bayes  optimal  discriminant  function  and  the 
ANN  having  too  many  parameters  with  too  little  training  data  to  effectively  estimate 
a  “good”  w.  Fewer  than  6  hidden  units  for  the  D  =  5  curve  will  result  in  a  poorer 
approximation  of  pt(7ra|x).  However,  this  minimum  number  of  hidden  units  is  dependent 
on  the  density  functions  of  the  underlying  data.  When  we  increased  the  SNR  of  the  data 
and,  hence,  simplified  the  optimal  mapping  function,  the  minimum  number  of  necessary 
hidden  units  decreased  to  1  (Fig.  9).  Given  an  abundance  of  training  data,  it  is  thus  safer 
to  err  on  the  side  of  more  hidden  units  than  fewer,  because  excess  weights  are  clearly  less 
detrimental  than  too  few.  This  is  not  true  when  one  has  limited  training  data. 

There  is  clearly  a  complicated  relationship  between  the  performance  of  a  Bayesian  ANN 
and  the  number  of  weights  H ,  the  sample  size  N ,  the  number  of  input  features  D,  and 
the  signal- to-noise  ratio  of  the  data  SNR.  Care  must  be  taken  with  each  new  neural 
network  model  to  ensure  that  the  optimal  decision  variable  pt(na\x)  is  being  appropriately 
approximated.  In  practice,  one  does  not  know  the  true  optimal  mapping  function,  so  inde¬ 
pendent  test  datasets  or  cross-validation  techniques  must  be  used  to  ensure  the  robustness 
of  Bayesian  ANN  classifiers. 
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Fig.  1.  Example  density  functions  with  a  =  1  and  b  =  0.5.  A  Bayesian  ANN  was  trained  using  500  samples 
from  Px{x\-na)  and  500  samples  from  px(x |7rn)  to  illustrate  the  estimation  task  of  the  Bayesian  ANN. 
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Fig.  2.  Using  the  data  sampled  from  the  density  functions  shown  in  Fig.  1,  the  Bayesian  ANN  produced 
the  mapping  function  pt( 7r0|x,w^P)  (dashed  curve)  shown  with  the  theoretical  mapping  function 
(solid  curve)  produced  using  the  actual  density  functions  and  not  the  data  sampled  from  the  density 
functions.  It  should  be  noted  that  the  largest  discrepancies  are  in  the  x  <  -2  region  of  the  plot  which 
is  an  area  where  there  is  little  data  (see  densities  in  Fig.  1).  The  error  calculations  employed  in  this 
work  are  weighted  by  the  density  of  the  data  px{x),  so  discrepancies  in  sparse  regions  of  the  mapping 
function  are  not  as  important  as  discrepancies  in  dense  regions. 


November  18,  1999 


IEEE  TMI  (CONFIDENTIAL) 


27 


(a) 


Fig.  3.  Given  the  theoretical  mapping  function  (Eqn.  20)  and  the  original  density  functions  (Eqns.  37 
and  38)  one  can  produce  the  density  functions  of  the  decision  variable  y  which  is  shown  (curves)  for 
(a)  the  abnormal  class  and  (b)  the  normal  class.  Plotted  with  each  density  function  is  a  histogram  of 
the  (a)  abnormal  training  data  and  (b)  the  normal  training  data  after  it  has  been  mapped  using  the 
transfer  function  pt(7ra\x,ijMAP)  in  Fig.  2.  The  histograms  were  scaled  to  have  an  area  of  1.  A  k  of 
1  was  used  to  produce  these  plots. 
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Fig.  4.  The  ideal  observer  ROC  curve  was  produced  using  Eqns.  1  and  2  and  the  theoretical  density 
'  functions  over  the  decision  variable.  The  Bayesian  ANN  ROC  curve  was  produced  using  the  “proper” 
binormal  ROC  curve  fitting  method  on  the  independent  testing  dataset  output  yl  of  the  Bayesian 
ANN  with  1000  samples  in  each  class,  i.e.,  N  =  2000. 
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(a)  0>) 

Fig.  5.  Two  different  mapping  functions  (a)  will  produce  the  same  ROC  curves  (b)  if  the  two  map¬ 
ping  functions  are  monotonic  transformations  of  one  another.  The  Bayesian  ANN  that  generated 
pt{7Ta\x.-j}MAP)  is  said  to  be  a  poorly  trained  ANN  because  it  does  not  approximate  pt(na\x)  well 
despite  the  fact  that  it  yields  the  same  ROC  curve. 
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Fig.  6.  The  effect  of  the  number  of  hidden  units  H  on  the  accuracy  of  Bayesian  ANNs  with  a  fixed 
SNR  =  1.26  and  sample  size  N  =  200.  Because  there  is  a  limited  training  dataset,  the  Bayesian 
ANN  cannot  properly  approximate  the  optimal  mapping  function  at  higher  dimensions  (D  =  4  and 
D  -  5).  The  error  bars  represent  ±2  standard  errors  of  each  mean  (5).  The  density  functions  used  in 
this  simulation  study  had  parameters  a  -  1  /y/D  and  b  -  0.5. 
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Fig.  7.  The  effect  of  the  number  of  hidden  units  H  on  the  accuracy  of  Bayesian  ANNs  with  a  fixed 
SNR  —  1.26  and  sample  size  N  =  1000.  With  enough  training  data,  the  Bayesian  ANN  can  properly 
approximate  the  optimal  mapping  function  at  the  higher  dimensions  (D  =  4  and  D  —  5).  The  effect 
of  increasing  the  number  of  hidden  units  after  a  certain,  point  is  no  longer  beneficial  but  also  not  very 
detrimental.  The  error  bars  represent  ±2  standard  errors  of  each  mean  (s).  The  density  functions 
used  in  this  simulation  study  had  parameters  a  =  l/y/D  and  b  =  0.5. 
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(a)  (b) 


(c)  (d) 

Fig.  8.  For  (a)  low  SNR  and  b  =  0.5,  the  (b)  optimal  mapping  function  is  complicated  for  values  of  x 
such  that  Px(x)  is  nonnegligible.  When  (c)  the  SNR  is  higher  for  the  same  b  =  0.5,  the  (d)  optimal 
mapping  function  is  sigmoidal  where  Px(x)  is  nonnegligible.  Equation  20  always  has  two  roots  when 
6  ^  1  so,  in  actuality,  the  mapping  function  shown  in  (d)  does  have  two  values  of  x  for  a  given  y 
(except  for  the  minimum)  but  the  second  x  (for  most  y)  occurs  in  an  uninteresting  region,  i.e.,  an 
observation  that  is  unlikely  to  occur.  Analogous  conclusions  can  be  made  for  D  larger  than  1. 
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Fig.  9.  The  effect  of  the  number  of  hidden  units  H  on  the  accuracy  of  Bayesian  ANNs  with  a  fixed 
SNR  =  3.80  and  sample  size  N  =  1000.  The  SNR  is  high,  so  the  optimal  mapping  function  is 
sigmoidal  in  shape  where  p£(x)  is  nonnegligible;  few  hidden  nodes  are  needed  to  model  this  optimal 
mapping  function.  Note  that  despite  the  increase  in  the  average  MSE  as  the  hidden  units  increases, 
the  magnitude  of  this  difference  is  still  small  when  compared  to  Fig.  7.  The  error  bars  represent  ±2 
standard  errors  of  each  mean  (s).  The  density  functions  used  in  this  simulation  study  had  parameters 
a  =  4 /yfD  and  b  =  0.5. 
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Fig.  10.  The  effect  of  the  signal-to-noise  ratio  SNR  on  the  accuracy  of  Bayesian  ANNs  with  fixed  H  =  4 
and  N  =  200.  Increasing  SNR  causes  a  decrease  in  the  average  MSE  between  the  Optimal  mapping 
function  and  the  Bayesian  ANN  model  of  that  function.  The  error  bars  represent  ±2  standard  errors 
of  each  mean  (s).  The  density  functions  used  in  this  simulation  study  had  a  fixed  b  =  0.5. 


November  18,  1999 


Fig.  11.  The  effect  of  the  signal-to-noise  ratio  SNR on  the  accuracy  of  Bayesian  ANNs  with  fixed 
H  —  10  and  N  —  1000.  The  average  MSE  remains  relatively  constant  as  SNR  increases  but  falls 
after  a  certain  point.  The  error  bars  represent  ±2  standard  errors  of  each  mean  (s).  The  density 
functions  used  in  this  simulation  study  had  a  fixed  b  —  0.5. 
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Fig.  12.  The  effect  of  sample  size  N  on  the  accuracy  of  Bayesian  ANNs  with  D  =  3,  SNR  =  1.26  and 
varying  H.  Increasing  the  training  dataset  size  results  in  a  better  approximation  of  pt{i ra|x)-  With 
fewer  hidden  units  {H  =  2),  the  average  MSE  flattens  out  quickly  as  N  increases.  When  more  hidden 
units  are  used,  the  curves  (H  =  4  and  H  =  6)  asymptotically  approach  a  much  smaller  average  MSE. 
The  error  bars  represent  ±2  standard  errors  of  each  mean  (s).  The  density  functions  used  in  this 
simulation  study  had  a  fixed  a  =  1/ VD  and  b  =  0.5. 


November  18,  1999 


IEEE  TMI  (CONFIDENTIAL) 


37 


(a)  (b) 

Fig.  13.  The  training  and  testing  dataset  ROC  curves  generated  using  D  =  5,  H  =  10,  SNR  =  1.26 
and  (a)  N  =  160  or  (b)  N  =  1000.  When  little  training  data  is  used  with  many  hidden  units,  the 
Bayesian  ANN  tends  to  overtrain,  resulting  in  a  large  training  dataset  ROC  curve  and  a  much  lower 
testing  dataset  ROC  curve.  When  more  samples  are  used,  the  training  and  testing  dataset  ROC 
curves  are  similar  to  the  ideal  observer  ROC  curve.  The  density  functions  used  in  this  simulation 
study  had  a  fixed  a  =  1/%/D  and  b  =  0.5.  The  average  MSE  between  the  optimal  mapping  function 
and  the  Bayesian  ANN  estimates  of  that  function  were  0.103  for  (a)  and  0.005  for  (b). 
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Abstract — It  is  well  understood  that  binary  classifiers  have 
two  implicit  objective  functions  (sensitivity  and  specificity) 
describing  their  performance.  Traditional  methods  of  clas¬ 
sifier  training  attempt  to  combine  these  two  objective  func¬ 
tions  (or  two  analogous  class  performance  measures)  into 
one,  so  that  conventional  scalar  optimization  techniques  can 
be  utilized.  This  involves  incorporating  a  priori  information 
into  the  aggregation  method  so  that  the  resulting  perfor¬ 
mance  of  the  classifier  is  satisfactory  for  the  task  at  hand. 
We  have  investigated  the  use  of  a  niched  Pareto  multiob¬ 
jective  genetic  algorithm  for  classifier  optimization.  With 
niched  Pareto  genetic  algorithms,  an  objective  vector  is  op¬ 
timized  instead  of  a  scalar  function,  eliminating  the  need 
to  aggregate  classification  objective  functions.  The  niched 
Pareto  genetic  algorithm  returns  a  set  of  optimal  solutions 
that  are  equivalent  in  the  absence  of  any  information  regard¬ 
ing  the  preferences  of  the  objectives.  The  a  priori  knowl¬ 
edge  that  was  used  for  aggregating  the  objective  function- 
s  in  conventional  classifier  training  can  instead  be  applied 
post-optimization  to  select  from  one  of  the  series  of  solutions 
returned  from  the  multi  objective  genetic  optimization.  We 
have  applied  this  technique  to  train  a  linear  classifier  and  an 
artificial  neural  network  using  simulated  datasets.  The  per¬ 
formances  of  the  solutions  returned  from  the  multiobjective 
genetic  optimization  represent  a  series  of  optimal  (sensitiv¬ 
ity,  specificity)  pairs,  which  can  be  thought  of  as  operating 
points  on  an  ROC  curve.  All  possible  ROC  curves  for  a  giv¬ 
en  dataset  and  classifier  are  less  than  or  equal  to  the  ROC 
curve  generated  by  the  niched  Pareto  genetic  optimization. 

Keywords —  Multiobjective  optimization,  genetic  algo¬ 
rithms,  diagnostic  classifiers,  ROC  analysis. 

I.  Introduction 

The  task  in  medical  diagnostic  decision  making  is  typ¬ 
ically  one 'of  employing  multiple  features  to  classify 
an  observation  as  normal  or  abnormal.  A  radiologist  may, 
for  example,  note  the  size,  shape  and  margin  sharpness 
of  a  potential  breast  lesion  in  a  mammogram  and  some¬ 
how  use  this  information  to  determine  whether  a  cancer 
is  present.  In  computer-aided  diagnosis  (CAD)  [1-3],  com¬ 
puters  take  features  extracted  from  medical  images  and  de¬ 
termine  whether  pathology  is  present  by  using  automated 
classifiers  [4,5].  It  is  well  known  that  the  optimal  method 
for  classifying  would  be  to  use  the  likelihood  ratio  or  any 
monotonic  transformation  of  the  likelihood  ratio  as  the  dis¬ 
criminant  function  [4].  The  goal  in  training  a  diagnostic 
classifier  is  to  employ  a  limited  dataset  to  determine  the 
parameters  of  the  classifier  such  that  it  approximates  the 
likelihood  ratio  decision  rule.  For  the  most  part,  these  clas- 
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sifiers  work  in  a  Similar  fashion.  A  dataset  of  features  ex¬ 
tracted  from  both  normal  (without  disease)  and  abnormal 
(with  disease)  images  is  used  for  determining  the  classifier 
parameter  values,  or  for  “training”  the  classifier,  so  that  it 
correctly  classifies  future  datasets  of  unknown  pathology. 

Classifier  training  can  be  viewed  as  an  optimization  prob¬ 
lem  where  the  quantity  to  be  maximized  is  the  performance 
of  the  classifier  on  an  independent  dataset.  There  are,  how¬ 
ever,  numerous  problems  with  representing  classifier  per¬ 
formance  by  a  single  (scalar)  objective  function,  which  is 
needed  so  that  one  can  use  a  scalar  optimizer  [6,7].  Bi¬ 
nary  classifiers  [4]  have,  in  essence,  two  implicit  objective 
functions:  one  describing  how  well  they  classify  the  abnor¬ 
mal  cases  (sensitivity)  and  one  describing  how  well  they 
classify  the  normal  cases  (specificity).  These  two  objective 
functions  are  non-commensurable,  implying  that  it  may  not 
be  possible  to  simultaneously  improve  both  the  sensitivity 
and  specificity.  Traditional  methods  of  classifier  training 
attempt  to -combine  these  two  objective  functions  (or  two 
analogous  class  performance  measures)  into  a  single  scalar 
objective  function  that  permits  the  use  of  conventional  (s- 
calar)  optimization  techniques  [8] .  A  drawback  of  this  ap¬ 
proach  is  that  the  proper  way  of  aggregating  the  objective 
functions  is  usually  unknown.  There  are,  in  fact,  an  infi¬ 
nite  number  of  ways  of  mapping  two  objective  functions  to 
a  single  scalar  function.  Even  when  a  priori  information 
about  the  relative  importance  of  the  two  objective  func¬ 
tions  is  available,  it  is  not  always  clear  how  to  incorporate 
it  in  the  aggregating  approach  to  objective  function  de¬ 
sign.  Sometimes  numerous  ad  hoc  combination  functions 
are  tried  until  a  suitable  objective  function  is  found  [8]. 
Most  classifiers  do  not  aggregate  sensitivity  and  specificity 
directly.  Artificial  neural  networks,  for  example,  typically 
employ  a  sum-of-squares  error  function  [5]  which  can  still 
be  thought  of  as  a  sum  of  two  non-commensurable  objec¬ 
tives,  i.e.,  one  objective  is  to  map  abnormal  observations  to 
a  value  close  to  1  and  the  other  objective  is  to  map  normal 
observations  to  a  value  close  to  0. 

Genetic  algorithms  (GAs)  [9]  have  been  applied  to  many 
diagnostic  and  classification  problems  [8,10-15].  A  conven¬ 
tional  GA,  however,  is  a  scalar  optimization  technique.  It 
thus  has  the  undesirable  features  of  an  aggregating-based 
approach.  One  method  of  avoiding  this  is  to  adopt  a  multi¬ 
objective  approach  [16,17]  to  the  optimization  problem.  In 
a  multiobjective  optimization  approach,  the  objective  func¬ 
tion  is  vector- valued  and  the  independent  objectives  (sensi¬ 
tivity  and  specificity)  are  optimized  simultaneously.  Thus, 
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the  need  to  aggregate  the  independent  objective  functions 
is  removed.  Unlike  a  scalar  optimization  that  returns  a  sin¬ 
gle  solution,  the  solution  to  the  multiobjective  optimization 
problem  is  a  set  of  solutions  called  the  Pareto-optimal  set. 
The  Pareto-optimal  set  is  defined  as  the  set  of  solutions 
for  which  no  other  solution  exists  that  is  better  in  both 
objectives.  In  the  absence  of  any  preference  information 
about  the  objectives,  the  members  of  the  Pareto-optimal 
set  are  equally  valid  solutions  to  the  optimization  prol> 
lem;  no  other  solutions  exist  that  are  better  in  all  of  the 
objectives.  In  the  context  of  diagnostic  classifier  optimiza¬ 
tion,  the  members  of  the  Pareto-optimal  set  correspond  to 
operating  points  on  an  optimal  ROC  curve,  whose  perfor¬ 
mances  describe  the  limiting  sensitivity-specificity  trade¬ 
offs  that  the  classifier  can  provide  for  the  given  training 
dataset.  Conventional  non-evolutionary  optimization  tech¬ 
niques  have  not  been  successfully  extended  to  the  multi¬ 
objective  case  because  they  are  not  designed  to  operate 
on  multiple  solutions.  Because  GAs  are  population-based, 
they  have  formed  the  basis  of  several  multiobjective  opti¬ 
mization  techniques,  collectively  referred  to  as  multiobjec¬ 
tive  GAs  (MGGAs)  [16-19]. 

In  this  paper,  we  investigate  the  application  of  a  MOGA 
called  a  niched  Pareto  GA  (NP-GA)  for  optimizing  the  per¬ 
formance  of  two  popular  diagnostic  classifiers.  The  paper 
is  organized  as  follows.  Section  II  contains  a  general  intro¬ 
duction  to  automated  classifiers  and  a  brief  description  of 
the  NP-GA.  Section  III  describes  the  two  classifiers  that 
were  studied,  and  it  describes  how  the  NP-GA  was  em¬ 
ployed  to  train  them.  The  results  of  the  two  optimizations 
are  presented  in  Section  IV.  Sections  V  and  VI  contain  a 
discussion  of  the  results  and  a  summary  of  the  advantages 
and  drawbacks  of  the  proposed  approach  to  diagnostic  clas¬ 
sifier  training  and  ROC  curve  generation. 

II.  Background 
A.  Automated  Classifiers 

An  automated  binary  classifier  separates  two  classes  of 
observations  (e.g.  images)  and  assigns  new  observations 
to  one  of  the  two  classes.  In  this  paper,  we  will  label  the 
two  classes  as  normal  (no  disease  evident)  and  abnormal 
(indicative  of  disease),  denoted  by  7rn  and  7ra,  respectively. 
Certain  characteristics  of  the  observations,  called  features, 
are  used  in  making  the  classification  decision.  The  set  of 
features  corresponding  to  an  observation  can  be  expressed 
by  a  vector  x  =  [xi,x2, . .  .  ,xp].  In  order  for  the  classifier 
to  be  “trained,”  we  start  with  a  dataset  of  known  patholo¬ 
gy  called  the  training  dataset.  A  graphical  depiction  of  an 
automated  classifier  for  a  two  feature  example  (p  =  2)  is 
shown  in  Fig.  1.  The  {xi,  x2}  space  spanned  by  the  feature 
vectors  is  denoted  by  S.  An  automated  classifier  uses  a  pa¬ 
rameter  vector  w  to  partition  this  space  into  the  sets  Cn( w), 
the  set  of  observations  that  belong  to  class  7rn,  and  Ca(uJ), 
the  set  of  observations  belonging  to  class  7ra.  The  parame¬ 
ters  ti;  of  a  classifier  can  represent,  for  example,  the  weights 
of  an  artificial  neural  network  or  the  threshold  values  in  a 
rule-based  classifier.  For  a  fixed  w,  Cn{w)uCa(w)  =  <S,  and 
Cn(w)nCa{w)  =  0. 


Fig.  1.  The  job  of  an  automated  classifier  is  to  partition  the  multi¬ 
dimensional  feature  space  into  two  partitions,  Co(vj)  belonging 
to  class  7ra  and  Cn(™)  belonging  to  class  7Tn.  These  partitions, 
Cn(ft7)  and  Ca(«7),  are  shown  by  the  shaded  and  unshaded  regions. 
The  two  classes,  ra  and  7rn,  are  represented  by  different  symbols 
(x’s  and  o’s).  The  decision  boundary  is  denoted  by  the  solid  line 
separating  the  shaded  from  the  unshaded  region. 

Given  a  measurement  x,  the  classifier  assigns  x  to  class 
7rn  if  x  6  Cn(w)  or  to  class  ?ra  if  x  6  Ca(w).  The  probability 
that  an  observation  belonging  to  class  7ra  is  correctly  classi¬ 
fied  is  referred  to  as  the  sensitivity  of  the  classifier,  denoted 
by  Sens(w).  Similarly,  the  probability  that  an  observation 
is  correctly  classified  as  belonging  to  class  7rn  is  referred 
to  as  the  specificity  of  the  classifier,  denoted  by  Spec(w). 
Note  that  both  the  sensitivity  and  specificity  of  the  clas¬ 
sifier  depend  explicitly  on  the  choice  of  w  and  implicitly 
on  the  underlying  distribution  of  the  normal  and  abnormal 
observations.  The  sensitivity  is  a  measure  of  how  well  the 
classifier  performs  on  abnormal  cases,  whereas  the  specifici¬ 
ty  is  a  measure  of  how  well  a  classifier  performs  on  normal 
cases.  In  practice,  the  fraction  of  class  7ra  observations  that 
are  correctly  classified  is  used  as  an  estimate  of  Sens(w). 
Likewise,  the  fraction  of  class  na  observations  that  are  cor¬ 
rectly  classified  is  used  as  an  estimate  of  Spec(w). 

A  popular  construct  used  for  describing  the  performance 
of  a  diagnostic  classifier  is  the  receiver  operating  character¬ 
istic  (ROC)  curve  [6,7,20,21].  An  ROC  curve  is  generated 
by  varying  the  value  of  one  (or  more)  of  the  components 
of  the  parameter  vector  w ,  and  plotting  the  correspond¬ 
ing  Sens(w)  and  Spec(w)  values.  For  example,  the  output 
threshold  is  usually  varied  to  generate  an  ROC  curve  for 
artificial  neural  networks  [22].  Traditionally,  the  classifier 
is  trained  prior  to  the  generation  of  the  ROC  curve  [22,23]. 
In  this  situation,  all  but  one  point  on  the  ROC  curve  rep¬ 
resent  operating  points  other  than  the  one  to  which  the 
classifier  was  naturally  trained.  An  ROC  curve  that  was 
generated  with  the  same  dataset  that  was  used  to  train 
the  classifier  is  referred  to  as  a  “consistency”  ROC  curve. 
A  “validation”  ROC  curve  is  obtained  when  the  curve  is 
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Fig.  2.  The  two  ROC  curves  have  equal  Az  values,  but,  depending 
upon  the  relative  preferences  concerning  the  sensitivity  or  speci¬ 
ficity  of  the  detection  task,  one  curve  is  typically  preferred  over 
the  other. 

generated  from  an  independent  data  set,  and  represents  an 
unbiased  estimate  of  classifier  performance  [24].  Two  typ¬ 
ical  ROC  curves  are  shown  in  Fig.  2.  The  area  under  an 
ROC  curve,  or  Az ,  is  an  accepted  way  of  comparing  overall 
classifier  performance  [6,7,20,21].  Two  curves  may  have 
equal  Az  values,  as  shown  in  Fig.  2;  however,  one  of  the 
curves  will  typically  be  preferred  over  the  other,  depend¬ 
ing  upon  the  relative  preference  of  the  sensitivity  and  the 
specificity  needed  for  the  task  at  hand. 

For  certain  types  of  classifiers,  such  as  rule-based  systems 
[3,25],  it  may  not  be  clear  how  w  should  be  varied  to  sweep 
out  the  ROC  curve  that  best  represents  the  sensitivity- 
specificity  tradeoffs  that  are  achievable  by  the  classifier  on 
the  specified  dataset.  The  ROC  curves  generated  by  vary¬ 
ing  different  sets  of  components  of  w  will  generally  be  d- 
ifferent.  representing  different  sensitivity-specificity  trade¬ 
offs  that  are  possible.  In  this  work,  we  demonstrate  that 
this  ambiguity  can  be  removed  if  one  uses  the  performances 
of  the  solutions  returned  by  a  multiobjective  optimization 
of  the  classifier  to  define  the  ROC  curve. 

B.  The  Niched  Pareto  GA 

We  have  implemented  a  multiobjective  optimization 
technique  called  a  Niched  Pareto  GA  (NP-GA),  which  is 
described  in  detail  by  Horn  et  al  [26].  Other  types  of  MO- 
GAs  have  been  proposed  and  are  described  in  reference 
[18].  The  NP-GA  can  be  viewed  as  a  conventional  (scalar) 
GA  that  uses  a  modified  tournament  selection  mechanis- 
m  and  ranking  scheme.  Readers  not  familiar  with  genetic 
algorithms  may  consult  reference  [9].  In  the  remainder  of 
this  section  we  review  the  NP-GA  proposed  by  Horn  et  al. 
[26]. 

In  order  to  directly  address  the  multiobjective  nature 
of  the  optimization  problem,  NP-GAs  employ  the  concept 
of  dominance.  A  solution  to  the  optimization  problem  is 
called  non-dominated  if  there  is  no  solution  superior  to  it 
in  all  objectives.  It  is  the  goal  of  the  NP-GA  to  discover 


the  set  of  all  non-dominated  solutions,  referred  to  as  the 
Pareto- optimal  set ,  all  of  which  are  considered  to  be  equal¬ 
ly  valid  solutions  to  the  problem  in  the  absence  of  any  a 
prion  information  about  the  relative  merits  of  the  different 
objectives.  If  a  solution  is  not  non-dominated,  it  is  referred 
to  as  being  dominated.  A  non-dominated  solution  is  said 
to  dominate  a  dominated  solution.  Equivalence  classes  of 
dominated  solutions  are  formed  by  grouping  them  accord¬ 
ing  to  the  number  of  solutions  that  dominate  them. 

This  grouping  of  solutions  into  distinct  classes  establish¬ 
es  a  partial  order  on  the  set  of  all  solutions  that  is  used 
to  determine  rank.  We  assume  that  the  Pareto-optimal  set 
corresponds  to  equivalence  class  0,  and  that  all  other  so¬ 
lutions  have  an  equivalence  class  greater  than  zero.  The 
rank  of  a  particular  solution  is  then  equal  to  its  equiva¬ 
lence  class  number.  This  ensures  that  solutions  within  the 
same  equivalence  class  have  the  same  rank,  which  reflects 
the  fact  that  solutions  within  the  same  class  are  equally 
“good”  in  the  absence  of  any  other  information. 

To  perform  selection,  the  NP-GA  uses  a  modified  tour¬ 
nament  selection  method.  In  a  scalar  GA,  tournament  s- 
election  is  one  of  the  methods  commonly  used  for  choos¬ 
ing  a  subset  of  solutions  in  the  current  generation  to  be 
placed  in  the  following  generation.  Implicit  in  its  formu¬ 
lation  is  the  assumption  that  there  exists  a  single  solution 
to  the  optimization  problem;  diversity  among  solutions  in 
the  population  will  be  lost  after  a  certain  number  of  gen¬ 
erations.  This  is  undesirable  in  a  multiobjective  optimiza¬ 
tion  where  we  wish  to  discover  all  of  the  members  of  the 
Pareto-optimal  set,  not  simply  a  single  solution.  To  cir¬ 
cumvent  this  difficulty,  Horn  et  al.  proposed  the  use  of  a 
Pareto  domination  tournament  in  conjunction  with  a  form 
of  fitness  sharing  called  equivalence  class  sharing.  A  Pareto 
domination  tournament  is  a  modified  conventional  tourna¬ 
ment  selection  method  that  uses  the  concept  of  dominance 
to  determine  the  winner  of  the  tournament.  First,  tdom 
randomly  selected  solutions  are  compared,  and  the  solution 
with  the  highest  rank  wins  (is  carried  over  to  next  genera¬ 
tion).  The  rank,  being  based  on  the  concept  of  dominance, 
incorporates  the  multiobjective  nature  of  the  problem  in¬ 
to  the  selection  mechanism.  For  situations  when  a  certain 
tournament  size  provides  insufficient  domination  pressure, 
the  size  of  the  tournament  (tdom)  can  be  increased. 

When  two  or  more  solutions  in  a  tournament  belong  to 
the  same  equivalence  class  (i.e.,  have  the  same  rank),  there 
will  not  be  a  clear  winner.  A  winner  cannot  simply  be  cho¬ 
sen  at  random,  because  genetic  drift  will  cause  the  popula¬ 
tion  to  converge  to  a  localized  region  of  the  Pareto-optimal 
set,  thus  obscuring  other  potential  solutions  to  the  opti¬ 
mization  problem.  Instead,  a  form  of  fitness  sharing  called 
equivalence  class  sharing  is  employed  to  determine  the  win¬ 
ner  of  a  tied  tournament.  In  equivalence  class  sharing,  the 
winner  of  a  tied  tournament  is  the  solution  that  has  the 
smallest  niche  count .  The  niche  count  estimates  the  densi¬ 
ty  of  solutions  in  a  localized  region  (niche)  around  a  given 
solution.  As  described  in  reference,  the  niche  count  m{  for 
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the  zth  solution  is  given  by 

771  j  =  ^  S(dij).  (1) 

jePop 

where  d{j  is  the  distance  (in  objective  space)  between  so¬ 
lutions  i  and  j,  and  s()  is  the  so  called  sharing  function 
given  by  s(d)  =  1  —  d/o’share  f°r  d  <  &  share  and  s(d)  —  ^ 
otherwise.  Here,  < 7 share  is  called  the  niche  radius ,  which 
represents  the  maximum  distance  between  solutions  that 
will  result  in  an  increase  in  their  niche  counts.  By  em¬ 
ploying  fitness  sharing  in  this  way,  the  Pareto-optimal  set 
is  more  likely  to  be  uniformly  sampled,  thus  providing  a 
more  diverse  set  of  potential  solutions  to  the  optimization 
problem  from  which  the  user  can  choose. 

III.  Methods 

We  trained  a  linear  classifier  and  an  artificial  neural  net¬ 
work  by  using  both  conventional  optimization  techniques 
and  the  NP-GA.  Two-dimensional  exclusive-OR  data  [27], 
sampled  from  the  density  functions  shown  in  Fig.  3,  were 
used  for  this  study  because  classifiers  typically  have  diffi¬ 
culty  in  adequately  classifying  both  classes  of  data  for  this 
problem.  Two-dimensional  isotropic  standard  normal  dis¬ 
tributions  with  mean  {fJLXltfJLX2)  and  variance  1  were  sam¬ 
pled  in  the  four  regions  of  the  exclusive-OR  problem.  The 
normal  class  (dashed  lines  in  Fig.  3)  occupied  the  regions 
centered  at  (1.3, 1.3)  and  (-1.3,  -1.3).  The  abnormal  class 
occupied  the  regions  centered  at  (1.3,  -1.3)  and  (—1.3, 1*3). 
A  total  of  100  normal  and  100  abnormal  samples  were  gen¬ 
erated  for  training  data.  An  additional  10, 000  normal  and 
10.000  abnormal  samples  were  generated  for  testing  the 
classifiers  after  they  had  been  trained.  The  performances 
of  the  conventionally  optimized  and  NP-GA  optimized  clas¬ 
sifiers  were  evaluated  on  both  the  training  and  the  testing 
datasets. 

A.  NP-GA  Implementation 

The  NP-GA  was  employed  to  simultaneously  maximize 
the  sensitivity  and  specificity  of  a  linear  classifier  and  an 
ANN  with  a  single  hidden  layer.  The  value  of  each  com¬ 
ponent  of  w  was  restricted  to  remain  within  a  maximum 
and  minimum  value,  determined  prior  to  the  optimization. 
A  binary  representation  of  the  chromosomes  [9]  was  uti¬ 
lized  so  that  each  real-valued  parameter  in  w  was  encoded 
by  a  binary  number  of  fixed  length.  The  range  of  each 
component  of  w  and  the  length  of  its  binary  representation 
determined  that  parameter’s  floating-point  precision.  The 
encoding  was  accomplished  by  linearly  scaling  the  float¬ 
ing  point  number  using  its  specified  range  to  an  integer 
between  0  and  2n  -  1  where  n  is  the  number  of  bits.  S- 
tandard  single-point  crossover  and  standard  mutation  were 
employed  as  the  genetic  operations  [9].  The  rates  of  the  ge¬ 
netic  operations  were  determined  empirically  by  perform¬ 
ing  multiple  optimizations.  A  crossover  rate  of  30%  and  a 
mutation  rate  of  5%  were  found  suitable  for  the  problems 
studied.  A  tdom  value  of  4  and  a  a  share  value  of  0.1  (or  10% 
of  the  range  of  each  objective)  were  also  found  to  work  well 


Fig.  3.  Contour  diagrams  of  the  two  density  functions  that  make 
up  the  exclusive-OR  problem.  The  abnormal  class  (solid  lines) 
occupies  the  upper-left  and  lower-right  quadrants,  whereas  the 
normal  class  (dashed  lines)  occupies  the  upper-right  and  lower- 
left  quadrants. 

for  the  optimization  problems  discussed  in  this  paper.  A 
discussion  of  these  parameter  settings  is  presented  later. 

B.  Classifiers 

A  linear  classifier  attempts  to  separate  the  two  classes 
of  observations  by  using  a  linear  decision  boundary.  We 
employed  logistic  discriminants  [5]  in  order  to  implement 
this  classification.  A  logistic  discriminant  projects  the  data 
onto  a  decision  variable,  and  then  a  threshold  is  applied 
for  determining  whether  a  given  observation  belongs  to  7ra 
or  t rn.  The  abnormal  set  for  a  logistic  discriminant  with 
parameter  vector  w  is  defined  as 

Ca{w)  =  {x  :  g(xfwT)  >  0.5},  (2) 

where  x!  =  *  * .  ,zP,  —  1]  =  [^  —  1]  and  p()  a  S*S" 

moidal  function  with  output  bound  between  0  and  1  [5]. 
The  normal  set  is  defined  as  Cn(w)  —  S  —  Ca(w).  The  con¬ 
ventional  method  for  generating  an  ROC  curve  for  a  logistic 
discriminant  is  to  vary  the  final  parameter  in  the  vector  w, 
which  results  in  a  translation  of  the  decision  boundary. 

The  NP-GA  was  used  to  optimize  the  parameters  of  a 
logistic  discriminant  so  as  to  work  with  the  exclusive-OR 
data  described  previously.  All  3  components  (for  2D  prob¬ 
lems,  w  has  3  components)  of  the  parameter  vector  w  were 
allowed  to  range  between  —3  and  3.  With  a  population 
size  of  500  solutions,  we  ran  the  NP-GA  for  a  total  of  100 
generations.  Conventional  logistic  discriminant  training,  as 
described  in  reference  [5],  was  employed  to  compare  with 
the  NP-GA  results. 

An  artificial  neural  network  (ANN)  is  a  set  of  connected 
nodes  that  is  loosely  based  on  the  human  neuron  system  [5, 
27-30].  For  classification  purposes,  an  ANN  can  be  thought 
of  as  a  mapping  function  that  uses  the  weights  w  to  map 
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an  input  vector  £  to  a  scalar  quantity  to  which  a  threshold 
is  applied  to  determine  whether  x  belongs  to  class  7ra  or 
7rn.  Unlike  logistic  discriminants,  an  ANN  can  separate 
the  two  classes  of  observations  using  a  non-linear  decision 
boundary.  The  abnormal  set  of  observations  for  an  ANN 
using  the  weights  w  is  given  by 

Ca(w)  =  {z  :  h(x;w)  >  0.5},  (3) 

where  h(x ;  w)  represents  the  non-linear  mapping  of  the  in¬ 
put  features  to  the  single  output  value  bound  between  0 
and  1. 

We  applied  the  NP-GA  to  optimize  an  ANN  on  the 
excIusive-OR  data.  A  two-layered  ANN  with  2  inputs,  2 
hidden  units,  and  one  output  unit  was  employed.  This  cor¬ 
responded  to  a  total  of  9  parameters  to  be  optimized.  The 
magnitudes  of  the  weights  were  forced  to  lie  between  -5 
and  5  in  order  to  simplify  the  optimization  task  and  to  reg¬ 
ularize  the  problem  somewhat,  because  large  weight  values 
represent  complex  decision  boundaries  [28].  A  population 
size  of  3000  solutions  was  run  for  a  total  of  100  generations 
for  this  study.  Conventional  error- backpropagation  ANN 
training  [5,27,29,30]  was  also  employed  numerous  times 
using  different  initial  conditions.  A  comparison  of  the  per¬ 
formances  of  the  NP-GA  results  with  the  best  conventional 
results  will  be  shown  along  with  a  comparison  of  the  NP- 
GA  performances  with  a  conventional  optimization  that 
was  trapped  in  a  local  minimum.  The  conventional  ROC 
curves  were  generated  by  varying  the  output  bias  weight 
value,  which  corresponds  to  one  component  of  w.  This  is 
equivalent  to  varying  the  neural  network  output  threshold. 
It  should  be  noted  that  Woods  and  Bowyer  [23]  studied  the 
effect  of  varying  weight  values  other  than  the  output  bias 
weight  in  generating  ROC  curves.  Their  study  conclud¬ 
ed  that  varying  a  subset  of  the  weights  can  produce  better 
ROC  curves  than  the  ROC  curves  produced  by  varying  the 
output  threshold  as  is  conventionally  done.  By  applying 
the  NP-GA  to  ANNs,  however,  we  are  effectively  allowing 
all  the  weights  to  vary  when  generating  the  ROC  curve, 
including  both  the  output  threshold  and  the  hidden  layer 
bias  weights  studied  in  the  Woods  and  Bowyer  work. 

IV.  Results 

A.  Linear  Classifier 

Figure  4  shows  the  performances  of  the  non-dominated 
solutions  returned  by  the  NP-GA  and  the  ROC  curve 
that  resulted  from  the  conventional  training,  generated  by 
thresholding  the  output  value.  The  operating  points  ob¬ 
tained  by  the  NP-GA  are  seen  to  be  better  than  the  corre¬ 
sponding  operating  points  on  the  conventional  ROC  curve 
in  the  high  sensitivity  region.  Figure  5  demonstrates  the 
same  behavior  when  the  NP-GA  solutions  and  the  con¬ 
ventional  solution  are  evaluated  on  the  independent  data 
set.  This  is  evidence  that  the  performance  improvement 
achieved  by  the  NP-GA  training  was  not  simply  a  result 
of  over-training.  However,  because  the  training  data  were 
sparse  between  the  four  regions  of  the  exclusive- OR  data, 
a  few  of  the  solutions  returned  by  the  NP-GA  show  slight 


1-Specificity 

Fig.  4.  Consistency  results  of  the  logistic  discriminant  training  us¬ 
ing  exclusive- OR  training  data.  The  circles  represent  the  perfor¬ 
mances  of  the  non-dominated  solutions  returned  by  the  NP-GA 
based  training.  The  solid  line  is  the  conventional  ROC  curve 
produced  by  varying  the  output  threshold  value  of  the  logistic 
discriminant  after  it  was  trained  using  a  scalar  optimization  tech¬ 
nique.  The  shaded  region  shows  the  performances  achievable  by 
all  possible  weight  vectors  w. 

signs  of  overfitting  when  tested  on  the  20, 000  testing  sam¬ 
ples,  as  is  demonstrated  by  the  fact  that  a  few  solutions  are 
dominated  when  evaluated  on  the  test  set.  The  majority  of 
the  solutions,  however,  do  not  show  signs  of  overtraining. 

The  ROC  curve  for  the  conventionally  trained  logistic 
discriminant  was  generated  by  varying  the  output  thresh¬ 
old  (final  parameter  in  w)  and  plotting  the  corresponding 
sensitivity  and  specificity  values.  Figure  6  shows  the  deci¬ 
sion  boundaries  at  various  output  thresholds  for  the  con¬ 
ventionally  trained  logistic  discriminant.  Decision  bound¬ 
aries  corresponding  to  different  threshold  values  are  seen 
to  be  parallel.  Because  of  this,  the  classifier  only  performs 
well  in  the  low-sensitivity  region.  If,  however,  the  decision 
boundaries  were  rotated  by  90  degrees  to  those  shown  in 
Fig.  6,  the  classifier  would,  instead,  perform  well  in  the 
high  sensitivity  Fegion.  The  advantage  of  the  NP-GA  is 
that,  at  different  ROC  operating  points,  the  orientation  of 
the  decision  boundary  can  be  different.  Thus,  the  NP-GA 
trained  logistic  discriminant  can  perform  optimally  in  both 
the  high  and  low  sensitivity  regions.  This  is  because  with 
the  NP-GA,  all  components  of  w  are  effectively  allowed 
to  vary  when  generating  the  ROC  curve  rather  than  just 
varying  the  value  of  one  of  the  parameters  and  keeping  the 
other  two  fixed. 

B.  Artificial  Neural  Network 

The  performances  of  the  NP-GA  results  on  the  200  train¬ 
ing  samples  is  shown  in  Fig.  7.  The  best  conventional  AN- 
N  optimization  ROC  curve,  created  by  varying  the  output 
threshold,  is  also  shown  in  Fig.  7.  The  NP-GA  result  is 
either  equal  to  or  better  than  the  best  conventional  result 
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Fig.  5.  Validation  results  of  the  logistic  discriminant  training  for 
20.  000  samples  from  the  exclusive-OR  data  distribution  to  eval¬ 
uate  the  performances.  The  circles  represent  the  performances 
of  the  non-dominated  solutions  returned  by  the  NP-GA  based 
training.  The  solid  line  is  the  conventional  ROC  curve  produced 
by  varying  the  output  threshold  value  of  the  logistic  discriminant 
after  it  was  trained  using  a  scalar  optimization  technique. 


Fig.  6.  An  explanation  of  why  the  conventionally-trained  logistic  dis¬ 
criminant  only  performs  well  in  the  low  sensitivity  region.  The 
decision  boundaries  corresponding  to  different  output  threshold 
values  of  the  discriminant  are  shown  superimposed  on  the  data 
distribution.  The  o’s  represent  normal  signals  and  the  x’s  rep¬ 
resent  the  abnormal  signals.  The  abnormal  region  is  to  the  left 
of  each  decision  boundary  and  the  normal  region  is  to  the  right 
of  each  decision  boundary.  When  the  threshold  value  is  varied, 
the  decision  boundary  is  simply  translated  with  its  orientation  re¬ 
maining  fixed.  By  analyzing  the  sensitivities  and  specificities  for 
each  decision  boundary,  one  can  generate  the  conventional  ROC 
curve  shown  in  Fig.  4.  In  order  for  the  classifier  to  perform  well 
in  the  high  sensitivity  region,  the  decision  boundaries  would  have 
to  be  rotated  by  90  degrees  which  would  result  in  the  classifier 
performing  poorly  in  the  low  sensitivity  region. 


Fig.  7.  Consistency  results  of  the  ANN  training  using  exclusive- 
OR  training  data.  The  circles  represent  the  performances  of  the 
non-dominated  solutions  returned  by  the  NP-GA  based  training. 
The  solid  line  is  the  conventional  ROC  curve  produced  by  varying 
the  output  threshold  value  of  the  ANN  after  it  was  trained  us¬ 
ing  a  scalar  optimization  technique.  The  dashed  line  represents 
the  result  of  a  conventionally  trained  ANN  trapped  in  a  local 
minimum.  The  conventional  training  became  trapped  in  local 
minima  in  approximately  30%  of  the  conventional  optimizations 
performed. 

at  all  points.  The  differences  are  small  in  most  regions,  but 
substantial  in  the  very  high  sensitivity  region  of  the  ROC 
curve.  No  regularization  techniques  were  applied  to  the 
conventional  optimization;  therefore,  one  would  typically 
be  concerned  about  over- training.  Figure  8  shows  the  val¬ 
idation  ROC  curves  generated  by  applying  the  optimized 
results  to  the  20, 000  testing  samples.  Again,  the  NP-GA 
result  is  closely  matched  with  the  conventional  result  at 
most  places  in  ROC  space  except  in  the  high  sensitivity 
region  where  the  NP-GA  result  is  noticeably  better  than 
the  conventional  result.  Overtraining  was  not  a  noticeable 
problem  in  both  of  these  optimizations  because  the  struc¬ 
ture  of  the  ANN  was  limited  (2  hidden  nodes)  in  both  runs 
and  the  parameter  range  of  the  NP-GA  was  limited  as  well. 

Local  minima  often  plague  conventional  ANN  optimiza¬ 
tions.  We  found  that,  depending  upon  the  initial  staring 
point,  our  ANN  converged  to  local  minima  about  30%  of 
the  time  as  was  evident  by  comparing  the  ROC  curves  of 
the  different  ANN  optimizations.  The  NP-GA  never  had  a 
problem  with  local  minima.  Figure  7  also  shows  the  per¬ 
formance  of  the  conventional  result  that  resided  in  a  local 
minimum  in  the  parameter  space  (dashed  line).  The  NP- 
GA  result  is  substantially  better  at  almost  all  points  in 
ROC  space. 

C.  NP-GA  Performance 

We  conducted  experiments  to  analyze  the  behavior  of 
the  NP-GA  and  verify  that  our  choice  of  NP-GA  operating 
parameter  settings  was  reasonable.  Figure  9  demonstrates 
the  convergence  of  the  non-dominated  set  when  the  ANN 
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Fig.  8.  Validation  results  of  the  ANN  training  on  20, 000  samples 
from  the  exclusive-OR  data  distribution  to  evaluate  the  per¬ 
formances.  The  circles  represent  the  performances  of  the  non- 
dominated  solutions  returned  by  the  NP-GA  based  training.  The 
solid  line  is  the  conventional  ROC  curve  produced  by  varying  the 
output  threshold  value  of  the  ANN  after  it  was  trained  using  a 
scalar  optimization  technique. 

was  trained  using  the  previously  described  training  data 
and  operating  parameter  settings.  Figures  9(a),  9(b),  9(c), 
and  9(d)  show  the  performances  of  the  non-domi'nated  so¬ 
lutions  evaluated  on  the  training  data  at  generations  2,  5, 
13,  and  100,  respectively.  It  can  be  seen  that  the  loci  of 
operating  points  migrate  upward  and  to  the  left  as  the  gen¬ 
eration  number  increases.  Beyond  100  generations,  the  loci 
of  operat  ing  points  remain  approximately  constant,  demon¬ 
strating  that  the  NP-GA  had  converged  to  a  stable  set  of 
solutions.  It  should  also  be  noted  that  the  relatively  high 
density  of  the  operating  points  returned  by  the  NP-GA 
indicates  that  the  non-dominated  set  of  solutions  was  ade¬ 
quately  sampled. 

Although  the  data  described  above  demonstrate  that  the 
NP-GA  converged  when  training  the  ANN,  we  do  not  know 
whether  the  final  set  of  solutions  represents  the  best  possi¬ 
ble  set  of  solutions  (i.e.,  the  Pareto-optimal  set).  To  verify 
this,  one  would  have  to  evaluate  the  performances  of  all 
the  possible  combinations  of  parameter  values  of  the  ANN, 
which  is  not  a  computationally  tractable  problem  with  cur¬ 
rent  computer  technology.  We  can,  however,  compute  this 
for  the  linear  classifier  because  it  possesses  only  3  free  pa¬ 
rameters.  The  shaded  region  in  Fig.  4  shows  the  operating 
points  achievable  by  all  possible  parameter  settings  for  the 
linear  classifier.  Because  most  of  the  operating  points  re¬ 
turned  by  the  NP-GA  lie  on  the  upper-left  boundary  of  the 
shaded  region,  we  can  conclude  that,  for  this  example,  the 
NP-GA  was  successful  at  converging  to  the  Pareto-optimal 
set. 

As  was  noticed  in  reference  [19],  we  observed  that  the 
size  of  the  Pareto  dominant  tournament  (tdom)  significant¬ 
ly  affected  the  convergence  behavior  of  the  NP-GA.  Figure 


10  shows  the  operating  points  returned  by  two  separate  ap¬ 
plications  of  the  NP-GA  to  the  ANN  training.  The  upper 
set  of  solutions,  discussed  previously,  was  obtained  with 
tdom  =  4-  The  lower  set  of  solutions  was  obtained  using 
the  same  NP-GA  operating  settings  except  with  tdom  =  2. 
With  tdom  =  2,  the  NP-GA  returned  a  set  of  solutions  that 
were  clearly  subopt imal.  One  explanation  of  this  result 
is  the  following:  When  a  tournament  selection  scheme  is 
used,  there  is  a  non-zero  probability  that  a  solution  in  a 
given  population  will  not  be  selected  to  compete  in  any  of 
the  tournaments.  This  can  result  in  a  potentially  “good’ 
solution  being  lost  by  the  NP-GA.  The  probability  of  los¬ 
ing  a  solution  in  this  way  is  equal  to  (^p)tdomiV\  where 
N  is  the  population  size.  When  N  is  large,  this  probabil¬ 
ity  converges  to  e~tdom.  For  tdom  —  2,  this  corresponds 
to  a  probability  of  0.135  of  losing  a  solution  in  any  given 
population.  When  td0m  =  4,  this  probability  is  reduced  to 
0.018.  By  increasing  the  size  of  the  tournament,  we  reduce 
the  probability  of  losing  a  potentially  good  solution  which 
could  contribute  to  inadequate  convergence  of  the  NP-GA. 

There  are  problems,  however,  with  using  too  large  a  tour¬ 
nament  size.  When  we  used  large  values  of  tdom  (f°r  ex" 
ample,  tdom  >  20),  the  NP-GA  converged  to  a  solution 
similar  to  that  achieved  for  tdom  =  4,  but  subsequently 
fluctuated  about  that  solution  as  a  function  of  generation 
number.  This  instability  is  a  result  of  having  domination 
tournaments  in  which  multiple  non-dominated  solutions 
are  forced  to  compete.  When  non-dominated  solutions  are 
forced  to  compete  in  multiple  tournaments,  one  or  more  of 
the  members  of  the  non-dominated  set  will  inevitably  be 
lost.  (The  niche  count  determines  the  winner  of  a  tied  tour¬ 
nament.)  The  observed  instability  of  the  non-dominated 
set  is  a  result  of  losing  and  re-gaining  non-dominated  so¬ 
lutions.  When  large  values  of  tdom  are  used,  the  value  of 
the  niche  size  {a share)  becomes  increasingly  important,  be¬ 
cause  multiple  tied  tournaments  may  arise.  For  td0m  =  4, 
we  found  that  the  NP-GA  performance  was  relatively  in¬ 
sensitive  to  the  value  of  a  share- 

V.  Discussion 

Genetic  algorithm  parameters  are  difficult  to  determine, 
and  few  methods  exist  to  systematically  set  the  GA  pa¬ 
rameters.  The  total  number  of  generations,  the  number  of 
solutions  in  each  generation,  the  crossover  rate  and  the 
mutation  rate  were  determined  experimentally.  Various 
GA  parameter  combinations  were  tested  and  the  results 
were  compared.  We  found  a  set  of  parameters  for  which 
the  results  were  consistent  in  the  sense  that  multiple  opti¬ 
mizations  gave  solutions  with  similar  performances.  If  the 
sets  returned  by  different  NP-GA  runs  were  not  optimal, 
one  would  expect  that  multiple  NP-GA  runs  would  return 
sets  with  either  better  or  poorer  performances.  We  also 
attempted  to  use  various  cr share  values  and  found  that  the 
NP-GA  results  were  robust  with  respect  to  crshare- 

The  NP-GA  exhibits  several  advantages  over  convention¬ 
al  classifier  training  techniques.  One  advantage  is  that 
the  objective  function  describing  the  optimization  task  is 
a  vector-valued  function.  This  eliminates  completely  the 
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Fig  9  Convergence  of  the  NP-GA  for  the  ANN  training  example  as  described  in  the  text.  Subfigures  (a),  (b),  (c),  and  (d)  show  the 
pcrformancL  of  the  non-dominated  solutions  at  generation  numbers  2,  5,  13,  and  100,  respectively.  As  the  general, on  number  mcreases, 
the  loci  of  operating  points  migrate  upward  and  to  the  left. 


need  to  aggregate  the  different  objectives  (sensitivity,  speci¬ 
ficity)  into  a  single  scalar  function.  Rather,  a  priori  in¬ 
formation  about  the  relative  preferences  of  the  objectives 
can  be  used  post-optimization  to  choose  a  member  of  the 
Pareto-optimal  set  as  the  ultimate  solution  to  the  problem. 

Another  advantage  is  that  a  set  of  non-dominated  solu¬ 
tions  is  returned,  rather  than  a  single  solution.  This  allows 
one  to  select  the  solution  (ROC  operating  point)  whose 
performance  is  most  clinically  appropriate  for  the  diagnos¬ 
tic  task  at  hand.  Conventional  classifier  optimizations  can 
return  a  series  of  solutions  in  the  form  of  an  ROC  curve 
obtained  by  varying  certain  components  of  w  after  the  clas¬ 
sifier  has  been  trained.  If  the  scalar  cost  function  employed 
is  an  aggregation  of  sensitivity  and  specificity  directly  then 
only  one  point  in  ROC  space  is  guaranteed  to  be  optimal. 
If  the  scalar  cost  function  is  an  aggregation  of  two  differen- 
t  performance  measures  (such  as  the  sum-of-squares  error 
function  for  ANNs)  then  no  point  is  guaranteed  to  be  opti¬ 
mal  in  ROC  space.  The  NP-GA  circumvents  this  problem 
by  allowing  all  parameters  in  w  to  effectively  vary  in  an  op¬ 


timal  manner  when  sweeping  out  the  ROC  curve.  In  this 
sense,  the  consistency  ROC  curve  returned  by  the  NP-GA, 
assuming  that  the  optimization  is  complete,  is  optimal  at 
every  point.  All  other  possible  performances  for  the  same 
classifier  and  dataset  are  either  equal  to  or  less  than  the 
ROC  curve  returned  by  the  NP-GA  optimization.  Training 
the  classifier  to  operate  at  a  particular  operating  point  and 
then  varying  a  subset  of  the  parameters  in  a  predetermined 
way  to  generate  the  ROC  curve  does  not  ensure  this. 

As  we  have  alluded  to  earlier,  conventional  methods  of 
classifier  optimization  can,  in  fact,  produce  the  Pareto- 
optimal  operating  points  through  multiple  runs  of  the  s- 
calar  optimization  procedure  with  different  weighting  fac¬ 
tors  on  sensitivity  and  specificity  (see  the  Appendix  for  a 
more  detailed  discussion  of  this).  Sensitivity  and  specifici¬ 
ty  are,  however,  discrete  counting  statistics  and  hence  are 
not  differentiable  functions  of  w.  Conventional  gradient- 
based  optimization  methods  such  as  backpropagation  can¬ 
not  be  employed  in  this  situation.  One  is  therefore  left  with 
running  multiple  scalar  stochastic  optimizations  (such  as 
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Fig.  10.  Effect  of  tdom  on  convergence  of  the  NP-GA.  At  a  tdom 
value  of  two,  the  NP-GA  converged  prematurely  because  of  the 
lack  of  domination  pressure.  For  the  problems  studied  in  this 
paper,  a  tdom  value  of  four  resulted  in  reliable  convergence  of 
the  NP-GA.  Large  values  of  tdom  caused  the  non-dominated  set 
to  fluctuate  randomly. 

GAs  or  simulated  annealing)  to  produce  the  same  operat¬ 
ing  points  that  were  produced  with  one  single  run  of  the 
NP-GA.  It  is  also  not  always  clear  how  to  set  the  relative 
weightings  to  evenly  sample  the  Pareto-optimal  set  using  a 
scalar  optimization  technique.  Another  option  would  be  to 
run  multiple  optimizations  using  a  conventional  cost  func¬ 
tion  such  as  the  sum-of-squares  cost  function  with  differ¬ 
ent  weightings  on  the  two  objectives.  No  point,  however, 
is  guaranteed  to  be  a  member  of  the  Pareto-optimal  set  if 
this  type  of  error  function  is  employed.  By  using  an  NP- 
GA  to  train  pattern  classifiers,  we  are  directly  addressing 
the  multiobjective  nature  of  classification  problem. 

If  the  density  functions  of  the  normal  and  abnormal 
classes  (/n(£)  and  fa( £),  respectively)  are  known,  then 
the  ROC  curve  that  is  produced  using  the  likelihood  ra¬ 
tio  LR{x)  =  fa(x)/fn(x)  or  any  monotonic  transformation 
of  the  likelihood  ratio  as  the  decision  variable  will  be  the 
“optimal”  ROC  curve  [20,31].  It  will  exhibit  the  best  clas¬ 
sification  performances  that  can  be  achieved  with  the  giv¬ 
en  density  functions.  It  is  often  very  difficult  with  limited 
datasets  to  estimate  the  density  functions  of  the  two  class¬ 
es  of  data;  thus  many  classifiers,  including  those  used  in 
this  paper,  make  no  attempt  to  accurately  estimate  these 
distributions.  The  “optimal”  ROC  curves  that  have  been 
discussed  in  this  work  are  quite  different.  Within  the  limi¬ 
tations  of  the  classifier  employed  and  the  dataset  used  for 
training,  the  ROC  curves  produced  using  the  NP-GA  are 
optimal,  i.e.,  there  is  no  better  ROC  curve  that  can  be 
produced  with  the  same  training  data  and  classifier. 

There  are  sacrifices  that  are  made  when  the  NP-GA  is 
used  for  classifier  optimization.  GAs  are  population-based 
stochastic  optimization  algorithms;  thus,  they  are  typical¬ 
ly  more  time  consuming  than  are  deterministic  algorithms. 
The  time  to  optimize  the  linear  classifier  on  a  400MHz 
Pentium  II  system  was  on  the  order  of  3  minutes.  The 


time  to  optimize  the  ANN  on  this  system  was  about  twen¬ 
ty  minutes.  In  fact,  for  very  complex  systems,  an  NP-GA 
optimization  may  be  impractical  with  current  computer 
technology.  For  ANNs  with  a  large  number  of  inputs  and 
hidden  nodes,  the  NP-GA  may  not  be  suitable  for  train¬ 
ing  with  current  computer  technology  because  of  the  large 
number  of  parameters.  In  these  situations,  techniques  for 
sweeping  out  ANN  ROC  curves  proposed  by  Woods  and 
Bowyer  [23]  may  be  better  suited.  The  NP-GA,  however, 
can  readily  be  made  to  run  in  parallel  which  would  sub¬ 
stantially  decrease  the  execution  time. 

This  paper  has  dealt  with  binary  classifiers.  It  is  often 
important,  however,  to  classify  observations  into  more  than 
two  classes  (benign,  malignant,  and  normal,  for  example). 
For  a  three-class  system,  aggregating  the  multiple  objective 
functions  into  a  single  scalar  function  suffers  from  the  same 
problems  as  the  two-class  problem,  but  to  a  greater  degree. 
Here,  it  is  even  more  difficult  to  adequately  incorporate  the 
class  preferences  in  the  aggregated  objective  function.  The 
ability  of  the  NP-GA  to  circumvent  this  difficulty  is  very 
attractive.  Because  the  non-dominated  set  of  solutions  will 
be  larger,  care  must  be  taken  in  determining  the  NP-GA 
parameter  settings  to  ensure  that  the  Pareto-optimal  set  is 
adequately  sampled. 

Complexity  and  over-training  are  issues  of  great  impor¬ 
tance  in  diagnostic  classifier  research,  and  in  particular  in 
ANN  training  [28,32],  In  practice,  there  is  typically  a  lim¬ 
ited  amount  of  training  data  available,  and  some  sort  of 
regularization  is  imposed  during  the  classifier  training  to 
ensure  that  it  performs  well  on  other  (unknown)  data  sets. 
It  is  well  known  that  large  ANN  weights  correspond  to  com¬ 
plex  separation  functions  [28,32]  that  may  be  indicative  of 
over- training.  To  avoid  this,  we  have  imposed  limit ation- 
s  on  the  magnitudes  of  the  ANN  weights  when  using  the 
NP-GA  to  determine  the  weight  values.  More  systemat¬ 
ic  methods  of  regularizing  the  NP-GA  based  training  may 
be  possible,  however.  One  such  method  is  to  add  a  third 
component  to  the  vector  objective  function  that  measures 
complexity.  In  this  way,  one  can  maximize  the  sensitivi¬ 
ty  and  specificity  while  minimizing  the  complexity  of  the 
classifier.  Depending  on  the  amount  and  quality  of  the 
available  training  data,  a  non-dominated  solution  returned 
by  the  NP-GA  can  be  chosen  such  that  the  classifier  per¬ 
formance  and  generalizability  of  the  result  are  appropriate 
for  the  classification  task.  We  are  currently  investigating 
this  approach  to  classifier  training. 

VI.  Conclusions 

We  have  studied  the  use  of  a  niched  Pareto  genetic  algo¬ 
rithm  in  training  two  popular  diagnostic  classifiers.  Unlike 
conventional  classifier  training  techniques  that  formulate 
the  problem  as  the  solution  to  a  scalar  optimization,  the 
NP-GA  explicitly  addresses  the  multiobjective  nature  of 
the  training  task.  It  has  been  demonstrated  that  the  multi¬ 
objective  approach  removes  the  ambiguity  associated  with 
defining  a  scalar  measure  of  classifier  performance,  and  that 
it  returns  a  set  of  optimal  solutions  that  are  equivalent  in 
the  absence  of  any  information  regarding  the  preference  of 
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the  objectives  (sensitivity,  specificity).  The  performances 
of  these  solutions  can  be  interpreted  as  operating  points 
on  an  optimal  ROC  curve,  describing  the  limiting  trade¬ 
offs  between  sensitivity  and  specificity  that  are  achievable 
by  that  classifier,  given  the  available  training  data.  The 
task  of  classifier  optimization  and  ROC  curve  generation 
are  combined  into  a  single  task.  It  was  demonstrated  that 
constructing  the  ROC  curve  in  this  way  may  result  in  a  bet¬ 
ter  ROC  curve  than  is  produced  by  conventional  methods 
of  ROC  curve  generation.  The  NP-GA  optimization  typi¬ 
cally  requires  more  computation  time  than  do  conventional 
non-stochastic  optimization  methods,  which  may  limit  it- 
s  application  to  certain  problems.  The  advantages  of  the 
NP-GA  approach  to  classifier  training  become  more  pro¬ 
nounced  when  the  number  of  classes  to  be  classified  in¬ 
creases  beyond  two. 

Appendix 

In  this  work,  we  have  investigated  the  use  of  a  multi¬ 
objective  optimization  algorithm  to  train  diagnostic  classi¬ 
fiers  and  generate  ROC  curves.  In  fact,  scalar  optimization 
methods  can  theoretically  arrive  at  the  same  ROC  curves 
as  a  multiobjective  optimization.  Consider  the  following 
scalar  optimization  problem: 

v 

Maximize  ^A  (Al) 

i=l 

where  w  is  an  element  of  the  space  of  possible  parameter 
vectors  W,  Ai  >  0  are  fixed,  and  Y%=i  A i  —  1*  Geoffrion 
[33]  proved  the  following  lemma: 

Lemma  1:  (a)  If  wo  maximizes  Eqn.  Al,  then 
uJ0  is  also  Pareto-optimal  in  the  vector  objective 
space  >  fp(w)\- 

(b)  Let  W  be  a  convex  set,  and  let  the  fi  be 
convex  on  W.  Then  w0  is  Pareto- optimal  if  and 
onlv  if  wo  maximizes  Eqn.  Al  for  some  Ai  >  0  and 

ZL  i. 

Because  the  multiobjective  training  problem  as  we  have 
formulated  it  satisfies  the  convexity  conditions  used  in  the 
Lemma,  it  must  be  true  that  the  optimal  ROC  operating 
points  can  be  obtained  by  performing  multiple  scalar  opti¬ 
mizations  with  varying  Ai’s. 

It  is  clear  from  Fig.  4  that  the  solutions  returned  by 
the  NP-GA  are  Pareto-optimal  because  for  this  problem, 
we  can  plot  the  performances  of  all  possible  solutions  (the 
shaded  region  in  Fig.  4).  However,  in  Fig.  7,  we  cannot 
plot  the  performances  of  all  possible  solutions  due  to  the 
large  dimensionality  of  the  parameter  space.  We  can,  how¬ 
ever,  make  a  comparison  between  the  solutions  returned  by 
the  NP-GA  and  the  solutions  returned  by  multiple  scalar 
optimizations  which  maximize 

A Sens(w)  +  (1  -  A )Spec(w)  (A2) 

with  A  varying  between  0  and  1.  We  implemented  a  scalar 
GA  using  the  same  GA  parameters  and  parameter  restric¬ 
tions  as  imposed  on  the  NP-GA  to  optimize  Eqn.  A2.  As 


Fig.  Al.  A  comparison  of  the  solutions  returned  by  the  NP-GA 
and  the  solutions  returned  by  20  scalar  optimizations  employing 
a  weighted  sum  of  sensitivity  and  specificity  as  the  scalar  cost 
function.  The  two  methods  returned  many  similar  solutions,  but 
the  solutions  returned  from  multiple  scalar  optimizations  tended 
to  clump  together  in  certain  areas  whereas  the  NP-GA  solutions 
were  uniformly  distributed  in  ROC  space.  Note  that  only  18  of 
the  20  scalar  solutions  were  distinct. 

described  above,  the  solutions  to  both  of  these  problem- 
s  should  be  Pareto-optimal  in  ROC  space  assuming  the 
optimizations  are  complete.  Figure  Al  compares  the  NP- 
GA  solutions  and  the  solution  achieved  through  multiple 
runs  of  a  scalar  optimization  with  varying  A.  The  points 
returned  by  the  multiple  scalar  optimizations  are  similar 
to  certain  points  returned  by  the  NP-GA.  Note  that  the 
multiple  scalar  optimized  solutions  are  clumped  together 
in  certain  areas  of  the  ROC  space.  It  is  unknown,  a-priori , 
how  to  vary  A  to  evenly  sample  the  Pareto-front,  whereas 
the  NP-GA  employs  niching  to  ensure  an  even  sampling 
of  the  Pareto-front  or  optimal  ROC  curve.  One  also  can¬ 
not  employ  gradient-based  techniques  to  optimize  discrete 
performance  measures  such  as  sensitivity  and  specificity. 
Because  of  this,  it  was  necessary  to  perform  20  separate 
stochastic  scalar  optimizations  to  get  the  20  ROC  operat¬ 
ing  points.  On  the  other  hand,  a  more  complete  sampling 
of  the  ROC  curve  was  obtained  by  a  single  run  of  the  NP- 
GA,  which  required  approximately  the  same  CPU  time  as 
one  run  of  the  scalar  optimizer.  So  despite  the  theoretical 
equivalence  of  the  two  methods,  there  are  practical  advan¬ 
tages  to  performing  a  single  multiobjective  optimization 
over  multiple  scalar  optimizations. 
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Abstract 

We  have  investigated  various  methods  of  feature  se¬ 
lection  for  two  different  data  classifiers  used  in  the 
computerized  detection  of  mass  lesions  in  digital  mam¬ 
mograms.  Numerous  features  were  extracted  from  ab¬ 
normal  and  normal  breast  regions  from  a  database 
consisting  of  210  individual  mammograms.  A  step¬ 
wise  method,  a  genetic  algorithm  and  individual  fea¬ 
ture  analysis  were  employed  to  select  a  subset  of  fea¬ 
tures  to  be  used  with  linear  discriminants.  Similar  tech¬ 
niques  were  also  employed  for  an  artificial  neural  net¬ 
work  classifier.  In  both  tests  the  genetic  algorithm  was 
able  to  either  outperform  or  equal  the  performance  of 
other  methods. 


1.  Introduction 

Computer-aided  diagnosis  in  digital  mammography 
is  a  topic  that  has  received  much  attention  recently [4, 
10,  13, 16]  due  to  the  potential  benefits  of  double  read¬ 
ing  in  mammography.  [9,  12]  Many  computerized  mass 
detection  schemes  employ  a  classifier,  such  as  a  neu¬ 
ral  network,  to  distinguish  between  lesions  and  false- 
positives.[5,  6, 14]  At  the  University  of  Chicago,  a  com¬ 
puterized  scheme  is  being  developed  in  which  features 
are  extracted  from  potential  lesion  sites  and  merged 
into  a  single  decision  variable  using  a  classifier.  Nu¬ 
merous  features  can  be  extracted  from  potential  lesions 
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sites [5]  making  it  difficult  to  optimally  choose  represen¬ 
tative  features  to  be  used  as  inputs  to  a  classifier.  In 
this  paper  we  will  undertake  the  problem  of  feature 
selection  for  two  different  classifiers  using  a  data  set 
consisting  of  features  extracted  from  lesions  and  false- 
positive  detections. 

2.  Materials  and  Methods 

The  database  employed  in  this  study  consists  of  210 
(53  cases)  individual  mammograms  with  111  visible  le¬ 
sions.  All  but  one  of  the  53  cases  contains  the  four 
standard  mammographic  views.  Images  were  scanned 
on  a  Konica  film  digitizer  to  a  matrix  size  of  512  by 
512  pixels  with  10-bit  quantization.  All  lesions  in  this 
database  were  biopsy  confirmed.  302  false-positive  re¬ 
gions  were  selected  from  the  database  to  be  used  in 
this  study  along  with  the  true-positive  regions.  A  total 
of  71  features  are  extracted  from  each  lesion  and  false¬ 
positive.  A  full  discussion  of  the  methods  we  utilized  to 
locate  and  extract  the  features  from  the  regions  of  in¬ 
terest  (ROIs)  can  be  found  in  our  previously  published 
papers. [1,  5,  15] 

Linear  discriminants,  namely  Fisher  discriminants, 
were  employed  as  the  initial  classifier.  Three  methods 
of  feature  selection  were  tested  for  linear  discriminants. 
The  first  was  to  select  those  features  that  exhibited  the 
greatest  individual  separation.  This  method  of  feature 
selection  is  a  rough  first-approximation  of  an  optimal 
subset  of  features  to  be  input  to  a  linear  classifier. 
Inter-feature  correlations  are  not  taken  into  account. 
Also,  it  does  not  provide  any  means  of  selecting  the 


number  of  features  to  be  used  as  inputs  to  the  linear 
discriminant. 

A  second  method  employed  to  select  a  subset  of 
features  for  a  linear  classifier  was  the  stepwise  selec¬ 
tion  method. [3]  The  performance  of  a  set  of  features 
is  characterized  by  a  parameter  known  as  the  Wilks’ 
lambda  which  is  the  proportion  of  the  total  variance 
attributed  to  within-group  variations  in  the  final  deci¬ 
sion  variable. [3]  If  the  Wilks’  lambda  is  0  then  all  of 
the  variance  is  due  to  between-groups  variations  so  the 
means  of  the  two  classes  are  well  separated.  Conversely, 
if  the  Wilks’  lambda  is  1  then  all  of  the  variance  is  due 
to  within-group  variations  and  one  can  conclude  that 
the  means  of  the  two  classes  in  the  final  decision  vari¬ 
able  are  equal.  In  the  stepwise  method,  the  first  feature 
is  selected  based  on  the  Wilks’  lambda  of  each  individ¬ 
ual  feature.  Successive  features  are  selected  based  on 
the  improvement  of  the  Wilks’  lambda.  After  a  fea¬ 
ture  is  added,  all  features  are  tested  for  removal.  This 
continues  until  the  statistical  significance  of  adding  or 
removing  a  feature  is  small.  The  advantages  of  this 
method  is  that  it  implicitly  takes  the  correlation  of 
features  into  account  and  also  selects  the  number  of 
features  to  be  used  as  inputs. 

The  third  method  employed  to  select  an  optimal 
subset  of  features  for  a  linear  classifier  was  a  ge¬ 
netic  algorithm. [2,  11]  A  genetic  algorithm  (GA)  is  a 
stochastic- based  search  method  based  on  the  princi¬ 
ples  of  evolution  in  nature.  The  fitness  function  we 
employed  was  the  Wilks’  lambda.  Runs  of  700  gener¬ 
ations  were  made  with  a  0.5%  probability  of  mutation 
and  a  70%  probability  of  crossover.  Figure  1  shows 
the  typical  performance  of  the  genetic  algorithm.  The 
set  of  features  with  the  best  performance  at  the  end  of 
the  GA  run  was  used  as  the  selected  input  feature  set. 
All  feature  sets  resulting  from  the  different  selection 
methods  were  evaluated  using  ROC  analysis  and  the 
performance,  was  characterized  by  the  area,  Az,  under 
the  ROC  curve. [7,  8] 

A  three-layered  back-propagation  neural  network 
was  also  studied  as  a  classifier.  ROC  analysis  was  per¬ 
formed  on  both  the  consistency  and  cross  validation 
results.  Features  were  selected  as  inputs  to  the  ar¬ 
tificial  neural  network  using  three  methods  similar  to 
those  used  for  selecting  features  for  the  linear  classifier. 
The  features  that  exhibited  the  greatest  individual  sep¬ 
aration  were  used  in  a  neural  network.  In  the  second 
method,  a  forward  selection  method  was  used.  The 
utility  function  was  the  Az  from  a  consistency  test  of 
a  simplified  ANN  structure.  Forward  selection  begins 
with  the  one  feature  which  has  the  best  individual  per¬ 
formance  and  tests  all  possible  combinations  of  that 
feature  with  another.  This  process  continues  until  the 


Figure  1.  Atypical  genetic  algorithm  run.  The 
shaded  area  represents  the  variation  in  pop¬ 
ulation  performance  (mean  ±  one  standard 
deviation)  at  each  generation. 


number  of  features  desired  has  been  selected.  There 
is  no  feature  elimination  stage  in  the  forward  selection 
method  as  there  is  in  the  stepwise  method.  The  final 
results  from  the  forward  selection  method  are  input 
to  an  ANN  with  2  hidden  units  and  both  consistency 
and  cross  validation  results  are  evaluated.  A  genetic 
algorithm  was  also  studied  with  the  Az  from  the  con¬ 
sistency  test  of  a  simplified  ANN  structure  used  as  the 
fitness  value.  A  simplified  structure  should  reduce  the 
effects  of  over-fitting  which  often  plague  artificial  neu¬ 
ral  networks.  This  final  set  of  features  resulting  from 
the  genetic  algorithm  were  then  input  to  a  more  com¬ 
plex  ANN  structure  (2  hidden  units)  where  both  con¬ 
sistency  and  cross  validation  tests  were  employed. 

3.  Results 

Table  1  shows  the  Az  values  for  the  feature  selec¬ 
tion  methods  used  for  determining  the  inputs  for  a  lin¬ 
ear  discriminant.  Wilks’  lambdas  are  also  shown.  It 
is  clear  from  the  table  that  selecting  features  based  on 
their  individual  performance  is  inadequate.  In  Figure  2 
the  three  different  feature  selection  methods  are  com¬ 
pared  using  the  ROC  curves  when  9  features  are  se¬ 
lected  by  each  method.  The  Az  values  for  the  feature 
sets  selected  by  the  genetic  algorithm  and  the  step¬ 
wise  method  are  statistically  significantly  (p  <  0.05) 
better  than  that  of  the  single  feature  analysis  method. 
The  genetic  algorithm  shows  a  slight  advantage  over 
the  stepwise  selection  method  but  it  is  not  statistically 


Table  1.  Summary  of  results  from  the  feature 
selection  methods  for  linear  discriminants. 


Method 

Az 

Wilks’ 

Lambda 

Number  of 
Features 

0.93 

0.53 

9 

Single  Feature 

0.92 

0.53 

10 

Analysis 

0.93 

0.51 

11 

0.94 

0.50 

12 

Stepwise 

0.94 

0.47 

9 

0.95 

0.47 

9 

Genetic 

0.95 

0.47 

10 

Algorithm 

0.95 

0.46 

11 

0.95 

0.46 

12 

Table  2.  Summary  of  results  from  the  feature 
selection  methods  for  artificial  neural  net¬ 
works. 


Method 

Cross 

Validation  Az 

Number  of 
Features 

Single  Feature 

0.96 

11 

Analysis 

Forward  Selection 

0.97 

11 

Genetic  Algorithm 

0.98 

10 

significant  (p  =  0.23). 

Table  2  shows  preliminary  results  from  the  ANN 
feature  selection  methods.  It  should  be  noted  that 
multiple  genetic  algorithm  runs  were  required  meaning 
that  the  genetic  algorithm  did  have  trouble  with  local 
maxima.  This  might  suggest  that  the  probability  of 
mutation  be  increased,  as  well  as  the  population  size, 
to  allow  for  more  diversity  throughout  the  runs.  As 
the  table  shows  the  set  of  features  selected  by  the  ge¬ 
netic  algorithm  was  able  to  outperform  the  other  two 
methods  but  the  results  were  not  statistically  signifi¬ 
cant  (p  =  0.06  for  the  individual  analysis  selector  and 
p  -  0.15  for  the  forward  selector).  The  corresponding 
ROC  curves  are  shown  in  Figure  3. 

4.  Discussion 

The  purpose  of  this  paper  has  been  to  introduce  fea¬ 
ture  selection  methods  and  compare  their  utility  with 
two  different  classifiers.  The  results  from  the  linear 
discriminant  analysis  show  that  the  genetic  algorithm 
feature  selection  method  is  as  good  if  not  better  than 
the  stepwise  method.  Similar  results  were  obtained  for 
the  artificial  neural  network  classifiers  but  the  results 


Figure  2.  ROC  curves  for  the  three  different 
linear  discriminant  features  selection  meth¬ 
ods  when  9  features  were  selected  by  each. 


Figure  3.  Cross  validation  ROC  curves  for 
ANN  feature  selectors. 


were  not  as  strong.  As  with  all  studies  employing  neu¬ 
ral  networks,  it  is  possible  that  there  is  over-fitting  of 
the  data.  We  attempted  to  minimize  this  effect  by  sim¬ 
plifying  the  structure  of  our  networks  and  by  employing 
cross  validation  or  leave-one-out  tests.  Future  work  will 
include  investigations  performed  on  larger  data  sets. 
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Abstract 

In  this  work,  we  study  and  evaluate  the  ability  of  two  classification  methods  (Bayesian 
artificial  neural  networks  (Bayesian  ANNs)  and  rule-based  classifiers  trained  using  a  mul¬ 
tiobjective  approach)  to  distinguish  between  malignant  masses  and  false  candidates  in 
the  computerized  detection  of  mass  lesions  in  mammography.  A  Bayesian  ANN  is  known 
to  accurately  approximate  the  ideal  observer  but  may  require  few  input  features  due  to 
training  dataset  limitations.  The  multiobjective  approach  to  classifier  training  returns  the 
best  possible  ROC  curve  that  a  given  classifier  can  achieve  on  a  given  training  dataset. 
Because  of  overtraining  issues,  the  multiobjective  approach  is  typically  used  to  optimize 
simple  classifiers  (i.  e.,  classifiers  with  relatively  few  parameters)  and,  thus,  one  can  employ 
more  input  features  with  these  classifiers.  Both  Bayesian  ANNs  and  a  simple  rule-based 
classifier  trained  using  the  multiobjective  approach  performed  well  in  the  task  of  distin¬ 
guishing  between  mass  lesions  and  false  candidates  in  mammography.  A  Bayesian  ANN 
with  5  features  outperformed  the  rule-based  classifier  which  used  a  total  of  eight  features. 
However,  the  rule-based  classifier  employed  only  8  classifier  parameters  instead  of  the 
Bayesian  ANN’s  71  parameters  (or  ANN  weights). 
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I.  Introduction 

Breast  cancer  is  the  most  common  malignancy  in  women  and  the  second  most  common 
cause  of  death  from  malignancy  in  this  patient  population.  In  the  United  States,  more  than 
180,000  women  develop  the  disease  each  year  [1]  and  women  who  live  to  an  advanced  age 
have  a  greater  than  one  in  nine  chance  of  developing  breast  cancer  during  their  lifetimes 
[2],  The  disease,  therefore,  represents  a  major  public  health  problem. 

Mammography,  x-ray  imaging  of  the  breast,  is  currently  the  best  method  for  the  early 
detection  of  breast  cancer.  Between  10  and  30%  of  women  who  have  breast  cancer  and 
undergo  mammography  have  negative  mammograms,  however  [3-8].  In  approximately 
two-thirds  of  these  false-negative  mammograms,  the  radiologist  failed  to  detect  a  cancer 
that  was  evident  retrospectively  [6,7,9,10].  The  missed  detections  may  be  due  to  the  subtle 
nature  of  the  radiographic  findings  (i.e.,  low  conspicuity  of  the  lesion),  poor  image  quality, 
eye  fatigue,  or  oversight  by  the  radiologists.  It  has  been  suggested  that  double  reading  (by 
two  radiologists)  may  increase  sensitivity  [11-15].  Thus,  one  aim  of  CAD  is  to  increase 
the  efficiency  and  effectiveness  of  screening  procedures  by  using  a  computer  system,  as  a 
“second  reader”  (like  a  “spell  checker”),  to  indicate  locations  of  suspicious  abnormalities 
in  mammograms  as  an  aid  to  the  radiologist  leaving  the  final  decision  regarding  the  like¬ 
lihood  of  the  presence  of  a  cancer  and  patient  management  to  the  radiologist  [16].  The 
interpretation  of  screening  mammograms  lends  itself  to  CAD  since  it  is  a  repetitive  task 
involving  mostly  normal  images. 

Pattern  classification  has  an  important  role  in  computerized  detection/diagnosis  schemes 
in  medical  imaging.  Thus,  the  ability  to  accurately  and  robustly  classify  suspicious  image 
regions  is  vital.  We  previously  presented  a  method  of  classification  that  can  accurately 
approximate  the  ideal  observer  given  a  large  enough  training  dataset  [17],  investigated 
the  effect  of  limited  datasets  on  feature  selection  [18],  and  also  introduced  a  method  of 
optimally  designing  and  evaluating  simple  classifiers  to  be  used  when  the  training  dataset 
is  not  large  [19].  In  this  paper,  we  will  use  the  knowledge  gained  from  the  research  per¬ 
formed  on  feature  selection,  Bayesian  ANNs,  and  the  multiobjective  approach  to  classifier 
training  to  design  classifiers  for  a  computerized  mass  detection  method.  Two  types  of  clas¬ 
sifiers  were  designed  (a  rule-based  classifier  and  an  ANN  classifier)  and  evaluated  using 
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ANN  Classifier  Rule-Based  Classifier 


Fig.  1.  An  overview  of  the  classification  and  feature  selection  methods  evaluated  for  use  in  a  computerized 

mass  detection  method. 

various  subsets  of  features.  Figure  1  shows  the  various  feature  selection  methods,  classi¬ 
fiers,  training  methods,  and  methods  of  evaluation  used  here.  This  paper  begins  with  an 
introduction  to  the  mass  detection  method  currently  being  developed  at  The  University 
of  Chicago.  Section  II  discusses  the  databases  used  to  determine  the  classifier  parameters 
and  to  validate  the  classifiers.  Section  IV  describes  the  features  selected  for  the  Bayesian 
ANN  and  the  application  of  the  Bayesian  ANN  to  the  mass  detection  method  (the  left  side 
of  Fig.  1).  Section  V  describes  the  classifier  implemented  and  the  features  selected  for  use 
in  the  multiobjective  approach  to  classifier  training  for  distinguishing  between  malignant 
lesions  and  false-positive  candidates  (the  right  side  of  Fig.  1).  Finally,  Sections  VI  and  VII 
summarize  the  results  and  gives  the  overall  performance  of  the  mass  detection  method. 

II.  Database 

A  database  of  177  screening  mammography  cases  containing  at  least  one  malignant 
mass  lesion  and  75  cases  not  containing  a  mass  were  employed  in  this  study  for  a  total  of 
252  cases.  There  were  a  total  of  181  malignant  mass  lesions  which  corresponded  to  333 
radiographically  visible  mass  lesions  on  the  864  digitized  films  (most,  but  not  all,  of  the  252 
cases  had  4  films  available  per  case).  The  images  were  digitized  on  either  a  Lumisys  100 
or  a  Lumisys  85  digitizer  to  100  /im  pixel  size  and  12-bit  gray-level  quantization.  Different 
screen-film  systems  as  well  as  different  exposure  conditions  were  used  for  the  images  in  this 
database.  The  only  image  correction  made  was  to  ensure  that  the  relationship  between 
pixel  values  and  optical  density  was  approximately  the  same  for  each  image.  Figure  2 
shows  the  characteristic  curves  of  the  two  digitizers  both  before  and  after  the  correction. 
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(a)  (b) 

Fig.  2.  The  original  characteristic  curves  of  the  two  digitizers  (Lumisys  100  and  Lumisys  85)  are  shown 
in  (a).  Images  digitized  on  the  two  Lumisys  digitizers  were  scaled  so  that  the  effective  characteristic 
curves  are  similar  as  is  shown  in  (b).  Note  that  the  scale  on  the  two  graphs  are  different;  the  scaled 
images  have  effectively  10  bits  per  pixel. 

As  we  will  discuss  in  the  next  section,  however,  many  of  the  features  employed  were  either 
geometric  based  and,  hence,  did  not  depend  on  the  pixel  values,  or  were  gradient  based 
features  which  only  depend  on  the  ratio  of  pixel  values  and  not  the  raw  values  themselves. 

Figure  3  shows  the  distribution  of  sizes  and  contrast  values  of  the  visible  mass  lesions 
as  measured  by  a  radiologist's  outlined  truth.  The  contrast  was  measured  by  taking  the 
average  within-lesion  pixel  value  and  subtracting  it  from  the  average  pixel  value  in  a  region 
outside  the  lesion  [20].  For  these  contrast  calculations,  the  region  outside  the  lesion  was 
defined  by  pixels  not  inside  the  lesion  but  within  a  bounding  box  around  the  lesion  that 
was  extended  3  mm  (or  10  pixels)  in  all  four  directions  making  this  analogous  to  the  “small 
window”  case  in  reference  [20].  Many  of  the  lesions  in  the  database  are  less  than  1.5  cm 
in  effective  diameter  and  have  a  contrast  less  than  0.25.  Thus,  a  substantial  fraction  of 
the  lesions  in  this  database  are  either  small,  low  contrast,  or  both. 

A  total  of  235  true-positive  candidates  were  both  detected  by  the  RGI  filtering  method 
(to  be  discussed  later)  and  segmented  with  an  overlap  fraction  greater  than  0.2  when 
compared  to  the  radiologist’s  outlined  truth.  This  corresponds  to  150  mass  cases  being 
detected  out  of  the  177  total  mass  cases.  There  also  were  over  11,000  false  candidates 
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(a)  (b) 

Fig.  3.  The  distribution  of  visible  mass  lesion  (a)  sizes  and  (b)  contrast  values  as  calculated  using  the 
radiologist’s  outlines  of  each  visible  lesion. 

returned  by  the  initial  detection  algorithm.  To  ensure  an  independent  evaluation  of  the 
mass  detection  method,  the  database  was  randomly  split  by-case  into  a  training  dataset 
and  a  validation  dataset.  The  training  dataset  consisted  of  126  cases  with  121  visible  mass 
lesions  (or  76  mass  cases)  that  were  both  detected  by  the  initial  detection  algorithm  and 
segmented  with  an  overlap  greater  than  0.2.  The  RGI  filtering  algorithm  also  returned 
5472  false  candidates  for  the  images  in  the  training  dataset  from  which  a  subset  of  242  false 
candidates  was  randomly  selected  for  classifier  training.  The  validation  dataset  consisted 
of  the  remaining  126  cases  with  114  visible  mass  lesions  (or  74  mass  cases)  that  were  both 
detected  by  the  initial  detection  algorithm  and  segmented  with  an  overlap  greater  than 
0.2  as  well  as  5685  false  candidates. 

III.  Lesion  Detection  and  Feature  Extraction 

Numerous  research  groups  have  developed  computerized  mass  detection  methods  which 
use  various  classification  techniques  such  as  neural  networks  [21—24],  linear  discriminant 
analysis  [23,25],  and  classification  trees  [26]  as  well  as  various  features  such  as  spiculation 
[27-29],  shape  [30,31],  and  texture  [32,33].  Rationale  and  details  of  these  techniques  and 
others  can  be  found  in  various  papers  and  chapters  [16,34-38]  and  in  proceedings  of  the 
International  Workshops  on  Digital  Mammography  [39-41]  or  the  International  Workshop 
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Phase  1 


Phase  2 


Fig.  4.  Overview  of  a  computerized  mass  detection  scheme. 

on  Computer-Aided  Diagnosis  [42],  Many  computerized  mass  detection  methods  use  a 
similar  two-phase  detection  approach.  First,  candidate  lesions  are  located  and  then  each 
candidate  is  further  analyzed  in  a  feature  analysis  and  classification  phase  to  determine 
the  final  classification  of  each  candidate.  The  mass  detection  method  being  developed  at 
The  University  of  Chicago  follows  this  pattern  as  is  shown  in  Fig.  4.  Each  stage  in  Fig.  4 
will  now  be  briefly  discussed. 

A.  Breast  Segmentation 

In  order  to  limit  the  region  of  search  for  lesion  detection,  the  breast  region  is  initially 
segmented  from  the  image.  Computer-defined  unexposed  and  direct-exposure  image  re¬ 
gions  are  used  to  generate  a  border  around  the  breast  region  [31].  Once  the  breast  region 
is  known,  preprocessing  and  search  are  limited  by  the  breast  border. 

B.  Initial  Detection 

The  computerized  mass  detection  method  being  developed  at  The  University  of  Chicago 
is  a  two-phase  detection  method*  the  first  detection  phase  locates  suspicious  areas  called 
lesion  candidates.  The  methods  used  in  this  phase  are  typically  highly  sensitive  but  return 
an  unacceptable  number  false-positive  candidates  to  be  used  alone.  We  have  developed  a 
non-linear  filtering  technique  called  radial-gradient  index  (RGI)  filtering  [43]  to  be  used  as 
the  initial  detection  algorithm.  RGI  filtering  takes  as  input  the  original  mammogram  and 
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the  breast  segmentation  results  returned  by  the  breast  segmentation  stage  and  outputs  a 
filtered  image  representing  the  confidence  that  a  lesion  is  present  at  each  location  within 
the  image. 

The  overall  performance  of  the  RGI  filtering  technique  is  shown  in  Fig.  5  with  the 
implicit  FROC  decision  variable  being  the  RGI  threshold  of  the  filtered  image  threshold. 
Because  images  are  being  thresholded  to  generate  these  FROC  curves,  they  exhibit  the 
unusual  behavior  of  starting  and  ending  near  the  (0, 0)  point  in  FROC  space.  This  is 
because  at  a  very  high  RGI  threshold,  no  pixels  pass  the  threshold,  and  one  achieves  0 
true  candidates  and  0  false  candidates.  Also,  at  a  very  low  threshold,  all  pixel  pass  and 
one  is  left  with  a  single  very  large  detection.  Thus,  at  very  low  thresholds,  one  achieves 
a  sensitivity  near  0  and  nearly  1  false  candidate  per  image.  As  is  shown  in  Fig.  5,  this 
technique  alone  can  achieve  a  sensitivity  of  93%  with  16  false  candidates  per  image  at 
an  RGI  threshold  of  0.74.  A  lesion  was  considered  “detected”  for  this  study  if  the  center 
of  mass  of  a  connected  region  in  the  thresholded  image  was  within  a  radiologist’s  outline 
of  the  lesion.  The  regions  returned  by  this  method  represent  the  candidate  lesions  to  be 
further  analyzed  by  the  subsequent  feature  extraction  and  pattern  classification  stages  of 
the  computerized  mass  detection  method. 

C.  Lesion  Segmentation 

In  order  to  automatically  extract  features  from  each  candidate  lesion,  the  potential 
abnormality  needs  to  be  segmented  from  the  breast  parenchyma  background  using  as 
input  a  given  seed  location  as  returned  by  the  previous  initial  detection  stage.  We  have 
developed  and  investigated  methods  of  lesion  segmentation  that  involve  multiplication  of 
the  suspect  location  (given  the  seed  point)  with  a  constraint  function  such  as  a  Gaussian 
and  generating  a  series  of  lesion-like  contours  by  performing  local  thresholding  on  this 
“constrained”  image  [44].  The  final  contour  is  identified  by  means  of  either  an  RGI-based 
or  a  probabilistic  method  as  described  elsewhere  in  the  literature  [44].  The  RGI-based  and 
probabilistic  methods  outperformed  a  conventional  region  growing  method  when  compared 
with  the  radiologist’s  outlined  truth.  The  probabilistic  segmentation  method  tends  to  be 
time  consuming,  so  we  implemented  the  RGI-based  segmentation  in  the  mass  detection 
method. 
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False  Detections  per  Image 


Fig.  5.  RGI  filtering  FROC  curves  for  various  minimum  size  cutoffs  using  the  RGI  threshold  as  the 
decision  variable.  Note  that  there  is  a  large  decrease  in  the  false-detection  rate  when  the  minimum 
size  cutoff  is  increased  from  0  to  1  without  a  large  decrease  in  the  sensitivity. 


-  Candidate  Lesion  £ 

. Effective  Circle  C 


Fig.  6.  The  sets  used  to  define  the  geometric-based  features. 


D.  Feature  Extraction 

The  image  data  and  the  contour  returned  by  the  lesion  segmentation  stage  are  em¬ 
ployed  to  extract  features  for  each  candidate  lesion  within  the  image.  These  features  are 
subsequently  used  in  the  pattern  classifier  to  determine  the  final  classification  of  each 
candidate  lesion.  Radiographically,  mass  lesions  can  be  characterized  by  their  degree  of 
spiculation,  margin  definition,  shape,  density,  homogeneity  (texture),  asymmetry,  and  so 
forth.  Descriptors  of  these  characteristics  may  be  also  grouped  as  gradient-based  features, 
intensity-based  features,  and  geometric  features. 
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D.l  Geometric  Features 

Geometric  features  use  the  results  of  the  lesion  segmentation  algorithm  to  arrive  at 
measures  of  the  size,  circularity,  irregularity,  and  compactness  [45]  of  each  candidate  lesion. 
The  geometric  features  are  defined  as: 

Perim(C) 

Irregularity  =  1  - 

Area(£  fl  C ) 

Circularity  =  Area W~ 

Area(£) 

Compactness  =  4ir  ppriTT1^y , 

where  Perim(-)  is  the  perimeter  operator,  Area(')  is  the  area  operator,  £  is  the  set  of  lesion 
pixels,  and  C  is  the  effective  circle  set  which  is  centered  at  the  candidate  lesion’s  center  of 
mass  and  has  an  area  equal  to  that  of  the  candidate  lesion  as  illustrated  in  Fig.  6. 


D.2  Gray-level  Features 


The  gray-level  features  use  both  the  pixel- value  information  (i.e.,  the  image  function 
f(x,y))  as  well  as  the  lesion  segmentation  results  (i.e.,  £)  to  compute  the  average  gray 
level  within  the  lesion,  the  standard  deviation  of  the  gray  levels  within  the  lesion,  the 
internal  contrast 


IC  = 


max  f{x,y) 
(x,y)ec _ 

min  f(x,y)' 

(x,y)€C 


(i) 


and  the  external  contrast 

ec=2_ (Avgl-AvgEl'  (2) 

Avgl  +  AvgE 

where  Avgl  is  the  average  pixel  value  within  the  lesion  £,  and  AvgE  is  the  average 
pixel  value  within  a  periphery  region  outside  of  the  lesion.  To  determine  the  periphery 
neighborhood,  a  bounding  box  for  each  candidate  lesion  was  extended  by  3mm  on  all  sides 
and  a  smoothed  version  [46]  of  the  candidate  lesion  contour  was  employed  to  determine 
which  image  pixels  to  exclude  (see  Fig.  7  for  an  example  periphery  region). 


D.3  Gradient  Features 

Gradient-based  features  are  measures  such  as  the  average  and  standard  deviation  of  the 
gradient  strengths  within  the  four  neighborhoods  [46]  (margin,  grown  region,  region-of- 
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Fig.  7.  The  four  neighborhoods  used  to  calculate  gradient-based  features. 

interest  ROI,  and  periphery)  as  shown  in  Fig.  7,  the  RGI  value,  and  numerous  gradient- 
weighted  histogram  (GWH)  features  [46].  To  compute  the  GWH  features,  one  first  com¬ 
putes  the  gradient  vectors  for  each  pixel  within  the  neighborhood  of  interest.  Then  a 
histogram  is  generated  of  the  either  the  Cartesian  angles  of  the  gradients  (Cartesian 
gradient-weighted  histograms  or  CGWHs)  or  the  angles  of  the  gradients  relative  to  the 
radial  direction  (radial  gradient-weighted  histograms  or  RGWHs).  Each  entry  that  is  ac¬ 
cumulated  into  these  histograms  is  weighted  by  the  magnitude  of  the  gradients  at  each 
angle  [46].  Kernel  density  estimation  using  a  Gaussian  kernel  with  the  width  determined 
optimally  via  cross-validation  is  used  to  achieve  a  continuous  estimate  of  the  histogram 
function.  Finally,  measures  such  as  the  full  width  at  half  maximum,  the  minimum  his¬ 
togram  value,  and  the  height  are  computed  as  illustrated  in  Fig.  8.  GWH  analysis  is 
performed  on  the  four  neighborhoods  (Fig.  7)  for  each  candidate  lesion  and  in  both  the 
Cartesian  (CGWH)  and  radial  directions  (RGWH).  GWH  features  can  characterize  margin 
sharpness,  spiculation,  linearity,  as  well  as  other  properties  of  candidate  lesions. 

In  total,  40  features  (31  gradient-based,  4  intensity-based,  and  5  geometric)  are  extracted 
from  each  candidate  lesion  site.  A  subset  of  these  features  must  be  used  in  the  final 
candidate  lesion  classification  stage.  In  general,  use  of  all  40  features  would  require  a 
much  larger  training  database  and  would  severely  limit  the  robustness  of  the  detection 
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Fig.  8.  An  illustration  of  the  features  extracted  from  the  gradient  weighted  histograms. 

method  (ie.,  the  potential  for  overtraining  would  be  large).  It  should  be  noted  that 
although  the  pixel  size  of  the  digitized  mammograms  was  100  fim,  subsampled  images 
with  effective  pixel  sizes  of  500  /ma  were  employed  in  the  initial  lesion  detection  and  in  the 
calculation  of  the  feature  values. 

E.  Classification 

The  task  in  this  stage  is  to  use  some  of  the  features  described  in  Section  III-D  to  classify 
the  candidate  lesion  sites  as  either  malignant  lesions  or  false  candidates.  We  have  imple¬ 
mented  a  Bayesian  ANN  and  a  rule-based  classifier  that  we  train  using  a  multiobjective 
approach  to  classify  candidate  lesions.  These  methods  will  be  discussed  in  detail  in  the 
next  sections. 

IV.  Application  of  the  Bayesian  ANN 

A.  Feature  Selection 

We  previously  demonstrated  [18]  the  bias  that  is  introduced  if  one  uses  the  same  dataset 
for  both  selecting  features  and  determining  the  parameters  of  a  classifier.  If  the  validation 
dataset  is  not  employed  in  the  feature  selection  process,  then  the  performance  of  the  clas¬ 
sifier  on  the  independent  validation  dataset  will  be  depressed.  Thus,  if  the  performance  on 
the  validation  dataset  of  the  classifier  using  features  selected  (using  the  training  dataset) 
by  an  automated  feature  selection  algorithm  is  better  (and  the  results  are  statistically 
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significant)  than  that  of  hand-selected  features,  then  the  features  selected  using  the  auto¬ 
mated  technique  are  “better”  than  those  selected  by-hand.  It  is  important  to  keep  small 
the  number  of  performance  comparisons  measured  on  the  validation  dataset  in  order  to 
reduce  random  effects. 

We  know  from  our  prior  simulation  studies  [17]  that  in  order  to  avoid  overtraining,  we 
can  use  a  maximum  of  five  features  in  the  Bayesian  ANN  based  on  our  training  dataset  size. 
Interestingly,  we  analyzed  the  performance  of  the  Bayesian  ANN  on  the  training  data  using 
various  subsets  of  six  features  and  found  that  it  was  susceptible  to  overtraining  as  indicated 
by  the  the  fact  that  the  performance  of  the  Bayesian  ANN  was  not  stable  with  increasing 
hidden  units.  Thus,  we  define  the  feature  selection  task  as  one  of  selecting  the  five  “best” 
features  to  be  used  as  inputs  for  the  Bayesian  ANN.  Features  that  based  upon  similar 
lesion  characteristics  were  employed  because  they  measure  the  characteristics  in  different 
fashions.  For  example,  two  spiculation  features  are,  generally,  somewhat  correlated  but 
each  also  contains  information  that  the  other  feature  does  not  extract  and,  thus,  both 
features  may  be  useful. 

Malignant  mass  lesions  tend  to  exhibit  spiculation,  they  are  more  circularly  shaped 
than  a  typical  false  candidate,  they  have  a  higher  density,  the  pixel  values  tend  to  be 
less  variable  (*.e.,  smoother  texture),  and  the  margins  of  lesions  are  more  well-defined 
than  false  candidates  [47].  Based  on  this  information  and  the  individual  performances 
of  the  various  features  in  the  database  as  measured  on  the  training  dataset,  the  features 
characterized  in  Table  I  were  hand-selected  to  be  used  in  the  Bayesian  ANN.  The  texture 
feature  was  not  included  in  the  hand-selected  feature  set  because  it  performed  poorly  by 
itself  in  the  task  of  distinguishing  between  mass  lesions  and  false  candidates. 

We  also  implemented  two  automated  feature  selection  methods  to  select  features  to  be 
used  in  the  Bayesian  ANN  (Fig.  1).  The  first  was  a  forward  selection  method  [48]  in 
which  the  features  that  improved  the  performance  the  most  were  added  in  succession.  For 
example,  if  a  contrast  measure  alone  has  the  best  individual  Bayesian  ANN  performance, 
then  contrast  is  added  to  the  list  of  selected  features.  Then,  every  other  feature  is  looked 
at  in  combination  with  contrast  to  see  which  other  feature  is  the  most  beneficial  when  used 
in  the  Bayesian  ANN  with  contrast.  This  continues  until  a  specified  number  of  features  is 
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TABLE  I 

Characterization  of  the  five  hand-selected  features  in  terms  of  the  properties  (t.e., 

SHAPE,  DENSITY,  TEXTURE,  SPICULATION,  AND/OR  THE  MARGINS)  EACH  FEATURE  IS  MEASURING. 


Feature 

Shape  Density  Texture  Spiculation  Margin 

RGI 

Contrast 

Margin  Strength 
Margin  RGWH  FWHM 
Margin  RGWH  Height 

X  x 

X 

X 

X 

X  X 

TABLE  II 

Characterization  of  the  features  selected  using  a  forward  selection  method. 


Feature 

Shape  Density  Texture  Spiculation  Margin 

Circularity 

Margin  Strength 

Contrast 

Margin  RGWH  Height 
Margin  CGWH  Height 

X 

X 

X 

X  X 

X  X 

selected.  Table  II  shows  the  features  that  were  automatically  selected  using  this  technique 
and  the  training  dataset.  Thirdly,  a  genetic  algorithm  (GA)  feature  selection  method 
[49,50]  was  implemented  in  order  to  select,  again,  five  features  to  be  used  in  the  Bayesian 
ANN.  The  five  GA-selected  features  are  shown  in  Table  III. 

B.  Performance 

Figure  9  shows  the  training  and  validation  dataset  ROC  curves  for  the  three  different 
subsets  of  five  features  (Tables  I,  II,  and  III)  when  trained  using  a  Bayesian  ANN  with  10 
hidden  units  for  each.  The  validation  dataset  Az  values  (Fig.  9(b))  for  the  hand-selected, 
forward-selected,  and  GA-selected  features  were  0.83,  0.88,  and  0.89,  respectively,  in  the 
task  of  distinguishing  between  actual  lesions  and  false  candidates.  The  difference  in  Az 
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TABLE  III 

A  CHARACTERIZATION  OF  THE  FEATURES  SELECTED  USING  A  GENETIC  ALGORITHM. 


Feature 

Shape  Density  Texture  Spiculation  Margin 

Circularity 

Area 

Contrast 

Grown  RGWH  Height 

Grown  CGWH  Minimum 

X 

X 

X 

X  X 

X  x 

between  the  hand-selected  and  the  forward-selected  feature  sets  on  the  validation  dataset 
was  statistically  significant  (p  <  0.01)  as  it  was  for  the  difference  in  Az  between  the  hand- 
selected  and  GA-selected  feature  sets  on  the  validation  dataset.  There  was  not  enough 
evidence  to  support  a  measurable  difference  in  Az  between  the  forward-selected  subset  of 
features  and  the  GA-selected  subset  of  features.  It  should  be  noted,  however,  that  the 
forward  selection  method  is  substantially  quicker  than  the  GA  feature  selection  method. 
The  differences  between  the  training  dataset  ROC  curves  and  the  validation  dataset  ROC 
curves  are  consistent  with  the  known  natural  bias  that  classification  systems  have  [51,52]. 

C.  Resampling  Performance 

We  previously  showed  [18]  that  it  is  unlikely  that  one  will  select  the  optimal  subset  of 
features  when  one  has  finite  data.  The  various  subsets  of  features  described  above  for 
■  use  in  the  Bayesian  ANN  are  likely  to  be  near-optimal  but  still  sub-optimal.  Thus,  if 
the  database  used  in  this  study  was  repartitioned  into  different  training  and  validation 
datasets,  different  features  and  different  classifier  performances  would  be  observered.  To 
study  this  effect,  we  repartitioned  the  original  database  by-case  into  ten  pairs  of  randomly 
selected  (by-case)  training  and  validation  datasets.  The  forward  selection  method  was 
then  applied  to  each  of  the  ten  training  datasets.  Table  IV  shows  the  the  number  of 
times  out  of  the  ten  runs  of  the  forward  feature  selection  method  that  each  feature  from 
Table  II  was  selected.  We  also  computed  the  average  training  dataset  Az  and  validation 
dataset  Az  values  which  were  0.91  ±  0.01  and  0.87  ±  0.02,  respectively.  The  average  ROC 
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Fig.  9.  (a)  Training  and  (b)  validation  performances  of  the  Bayesian  ANN  with  three  different  subsets 

of  5  features.  The  classification  task  was  to  distinguish  between  actual  lesions  and  false  candidates. 

curves  (obtained  by  averaging  the  a  and  b  ROC  curve  parameters  [53])  are  shown  in 
Fig.  10.  It  is  clear  from  Table  IV  that  a  couple  of  the  features  are  frequently  selected  by 
the  forward  selection  method  and  are,  thus,  important  for  the  classification  of  the  lesions. 
Because  other  features  are  less  often  selected  and  because  the  performance  of  the  Bayesian 
ANN  is  consistent,  these  features  are  “replaceable”  by  other  features  that  measure  similar 
characteristics. 

V.  Application  of  the  Multiobjective  Approach 

The  Bayesian  ANN  approximates  the  ideal  observer  but  typically  requires  numerous 
parameters  to  do  so  [17].  It  is  often  desirable  to  use  a  classifier  with  simple  and  under¬ 
standable  rules.  Because  simple  rules  (i.e.,  few  classifier  parameters)  are  employed,  it  is 
possible  to  incorporate  more  features  into  the  classification  system  than  would  be  possi¬ 
ble  with  an  ANN.  We  implemented  a  simple  thresholding  rule-based  classifier  in  which 
features  are  sequentially  thresholded  to  determine  the  class  of  the  candidate  lesions.  An 
example  of  a  thresholding  rule-based  classifier  would  be  to  call  a  candidate  lesion  a  ma¬ 
lignant  lesion  if  and  only  if  the  circularity  is  greater  than  0.5,  the  contrast  is  greater  than 
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TABLE  IV 

The  number  of  times  out  of  the  ten  different  forward  selection  runs  that  various 

FEATURES  WERE  SELECTED.  THE  FEATURES  SHOWN  IN  THIS  TABLE  ARE  THE  ONE  SELECTED  BY  THE 
FORWARD  FEATURE  SELECTION  METHOD  ON  THE  ORIGINAL  PARTITION  OF  THE  DATASET  TABLE 

II). 


Feature 

Number  of  times  selected 

out  of  10 

Circularity 

4 

Margin  Strength 

2 

Contrast 

7 

Margin  RGWH  Height 

9 

Margin  CGWH  Height 

1 

Fig.  10.  The  average  Bayesian  ANN  ROC  curves  for  the  10  repartitions  of  the  entire  dataset.  The 
classification  task  was  to  distinguish  between  actual  lesions  and  false  candidates  in  mammography. 
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0.2,  and  the  margin  sharpness  is  above  1.2.  If  any  of  these  conditions  are  not  met  then 
the  candidate  lesion  is  considered  a  false  detection.  We  used  the  Niched  Pareto  genetic 
algorithm  (NP-GA)  described  in  [19]  to  design  a  thresholding  rule-based  classifier  for  the 
computerized  detection  of  malignant,  mass  lesions  in  mammography.  Results  on  both  the 
training  and  validation  datasets  are  presented. 

A.  Feature  Selection 

Because  the  classifier  we  are  implementing  requires  only  thresholding  rules,  feature 
selection  is  a  simpler  task.  One  must  choose  features  that  individually  perform  well  (i.e., 
have  a  high  Az)  and  are  not  highly  correlated  with  one  another  (i.e.,  do  not  provide 
redundant  information).  Because  fewer  parameters  are  used  with  this  classifier  we  can 
likely  use  more  input  features  than  was  possible  with  the  Bayesian  ANN.  We  selected 
eight  features  to  be  used  in  the  classifier.  These  eight  Jeatures  selected  were  done  so  based 
on  expert  knowledge  (i.e.,  based  on  what  radiologists’  use  to  make  decisions),  individual 
performance  (i.e.,  Az),  and  correlations  with  other  features.  The  characteristics  of  these 
eight  features  are  listed  in  Table  V.  The  maximum  correlation  among  these  features  was 
0.74.  Note  that  this  set  of  features  is  a  superset  of  the  features  selected  by  the  forward- 
selection  method  shown  in  Table  II.  Other  similar  subsets  of  8  features  were  also  evaluated 
but  the  results  did  not  seem  to  vary  greatly.  Subsets  of  fewer  than  eight  features  were  also 
evaluated  and  found  to  perform,  on  average,  worse  than  did  the  eight  features.  Subsets  of 
more  than  eight  features  were  evaluated  and  found  to  have  similar  performances  as  well, 
thus  there  did  not  seem  to  be  any  benefit  in  adding  additional  features  beyond  eight. 

B.  NP-GA  Performance 

Figure  11  shows  the  performance  of  the  NP-GA  trained  rule- based  classifier  on  the 
training  and  validation  datasets.  The  A2  of  the  rule-based  classifier  was  0.85  on  the  train¬ 
ing  dataset  and  0.80  on  the  validation  dataset.  The  substantial  difference  in  performance 
between  the  training  and  validation  datasets  may  indicate  that  the  NP-GA  is  fine-tuning 
the  thresholds  to  such  an  extent  that  they  are  not  generalizable  when  applied  to  an  in¬ 
dependent  dataset.  However,  the  number  of  parameters  used  in  the  rule-based  classifier 
was  8  as  compared  with  the  71  used  in  the  Bayesian  ANN  (i.e.,  a  Bayesian  ANN  with  5 
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TABLE  V 

A  CHARACTERIZATION  OF  THE  FEATURES  SELECTED  FOR  USE  IN  THE  NP-GA. 


Feature 

Shape  Density  Texture  Spiculation  Margin 

Circularity 

RGI 

Contrast 

Pixel  Variation 

Margin  Strength 
Margin  RGWH  FWHM 
Margin  RGWH  Height 
Margin  CGWH  Height 

X 

X  x 

X 

X 

X 

X 

X  X 

X  X 

input  features,  10  hidden  nodes  and  1  output  node  has  a  total  of  71  parameters)  which 
was  able  to  achieve  an  Az  of  0.89  on  the  validation  dataset.  Thus,  it  is  interesting  to  note 
the  high  performance  of  the  rule-based  classifier  trained  using  the  NP-GA  and  using  very 
few  parameters. 

In  order  to  fit  the  ROC  curves  shown  in  Fig.  11,  one  must  generate  decision  variable 
data  to  be  input  into  the  ROC  curve  fitting  software.  To  accomplish  this,  the  NP-GA 
returned  ROC  operating  points,  and  the  a  priori  knowledge  of  the  numbers  of  true  and 
false  observations  in  the  dataset  are  used  to  generate  mock  decision  variable  data  which 
is  input  into  the  ROC  software  [53].  If  the  ROC  curve  returned  by  the  NP-GA  has  N 
operating  points  (including  the  (0, 0)  and  (1,1)  points  in  ROC  space),  then  one  has  N  —  1 
bins  of  decision  variable  data  to  generate  where  values  of  the  decision  variable  data  are 
1, 2, . . . ,  N  - 1 .  One  fills  in  these  bins  of  data  such  that  if  N  thresholds  are  applied  between 
the  bins,  then  the  sensitivity  and  specificity  values  of  the  ROC  curve  are  achieved.  For 
example,  if  one  has  100  true  and  100  false  detections  in  the  dataset  and  the  4th  and  5th 
■adjacent  operating  points  in  the  ROC  curve  have  sensitivities  of  90%  and  85%,  respectively, 
then  one  generates  5  (90%  -  85%  =  5%  of  100)  mock  true-positive  decision  variable  data 
with  values  of  4  (the  4th  operating  point).  If,  the  6th  operating  point  has  a  sensitivity 
of  70%,  then  one  generates  15  (85%  -  70%  =  15%  of  100)  mock  true-positive  decision 
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Fig.  11.  The  performance  of  the  NP-GA  trained  rule-based  classifier  on  the  training  and  validation 
datasets  in  the  task  of  distinguishing  between  actual  lesions  and  false  candidates. 

variable  data  with  values  of  5  (the  5th  operating  point).  A  similar  strategy  is  taken  with 
the  false-positive  decision  variable  data  as  well. 

C.  Resampling  Performance 

An  evaluation  of  the  performance  of  the  rule-based  classifier  trained  using  the  NP-GA 
using  different  dataset  partitions  was  performed.  In  this  study,  the  features  used  in  the 
classifier  were  fixed  (i.e.,  the  features  shown  in  Table  V)  and  the  NP-GA  was  rerun  10 
times  on  each  training  dataset  and  then  tested  on  each  validation  dataset.  The  average 
training  dataset  Az  was  0.87±0.02,  and  the  average  validation  dataset  Az  was  0.81  ±  0.02. 
The  average  ROC  curves  (obtained  by  averaging  the  10  a  and  b  ROC  curve  parameters 
[53])  are  shown  in  Fig.  12. 

VI.  Discussion 

The  Bayesian  ANN  can  achieve  an  Az  of  0.89  in  the  task  of  distinguishing  between 
malignant  mass  lesions  and  false  candidates  as  was  shown  in  Fig.  9(b).  Using  more  features 
with  the  rule-based  classifier  trained  using  the  NP-GA,  we  achieved  an  Az  of  0.80  in  the 
task  of  distinguishing  between  malignant  mass  lesions  and  false  candidates.  While  it  is 
generally  not  true  that  a  Bayesian  ANN  with  fewer  features  will  always  outperform  an 


September  23,  2000 


MEDICAL  PHYSICS-UNDER  REVIEW  (CONFIDENTIAL) 


21 


Fig.  12.  The  average  training  dataset  and  validation  dataset  ROC  curves  for  10  random  partitions  of 
the  entire  dataset  into  training  and  validation  datasets.  The  classification  task  was  to  distinguish 
between  actual  lesions  and  false  candidates  in  mammography. 

NP-GA  trained  rule-based  classifier,  it  is  true  for  the  classification  task  discussed  in  this 
paper. 

It  is  apparent  from  Fig.  9  that  the  Bayesian  ANN  is  not  becoming  too  specific  to  the 
training  dataset  and  the  results  are  generalizable  to  the  validation  dataset  and  other 
independent  datasets.  However,  from  Fig.  11,  one  can  see  that  the  NP-GA  rule-based 
classifier  is  becoming  too  specific  to  the  training  dataset.  The  NP-GA  is  designed  to 
produce  the  best  possible  ROC  curve  that  the  given  classifier  (a  rule-based  classifier  in 
.  this  case)  can  achieve  on  the  given  training  dataset.  Thus  the  threshold  values  can  become 
too  specific  to  the  training  dataset.  This  causes  the  NP-GA  trained  rule-based  classifier 
to  perform  poorly  on  the  validation  dataset.  Although  not  shown  in  this  work,  different 
numbers  of  features  and  different  subsets  of  features  were  applied  to  the  NP-GA  training 
with  similar  results. 

Thus  far,  we  have  shown  the  performance  of  only  the  classifier  in  the  task  of  distin¬ 
guishing  between  malignant  mass  lesions  and  false  candidates.  In  order  to  evaluate  the 
overall  mass  detection  method,  FROC  analysis  [54—56]  is  employed.  The  Bayesian  ANN 
was  chosen  as  the  final  classifier  with  the  input  features  being  those  selected  using  the 


September  23,  2000 


MEDICAL  PHYSICS-UNDER  REVIEW  (CONFIDENTIAL) 


22 


Fig.  13.  The  performance  of  the  overall  mass  detection  method  in  the  task  of  distinguishing  between 
actual  lesions  and  false  candidates  using  the  Bayesian  ANN  with  5  input  features  as  the  classifier. 

forward  selection  method  (Table  II).  If  a  candidate  lesion  was  classified  by  the  Bayesian 
ANN  as  being  a  malignant  lesion  and  that  candidate  lesion  overlapped  at  all  with  a  ra¬ 
diologist’s  outline  of  the  true  lesions,  then  that  lesion  was  considered  to  be  detected  by 
the  mass  detection  method.  We  plot  our  FROC  curve  using  the  by-case  sensitivity,  i.e.,  a 
lesion  needed  only  be  located  by  the  computer  in  one  image  to  be  considered  “detected” . 
Also,  if  there  are  more  than  5  detections  in  a  single  image,  then  only  the  5  detections  with 
the  highest  ANN  output  values  are  used  so  as  to  limit  the  total  number  of  detections  in  a 
single  image.  Figure  13  shows  the  by-case  FROC  curve  for  the  mass  detection  method  us¬ 
ing  the  Bayesian  ANN  as  the  classifier.  Using  the  Bayesian  ANN,  we  were  able  to  achieve 
a  sensitivity  of  75%  at  2  false  positives  per  image  on  the  validation  dataset. 

The  performance  of  the  mass  detection  method  characterized  in  Fig.  13  is  affected  not 
only  by  the  pattern  classification  stage  but  also  by  the  performance  of  the  initial  detection 
method,  the  ability  of  the  segmentation  algorithm  to  accurately  segment  potential  ab¬ 
normalities  from  surrounding  parenchyma  given  the  initial  detection  information,  and  the 
ability  of  the  feature  extraction  stage  to  extract  useful  information  for  the  classifier.  Thus, 
to  improve  the  performance  of  this  mass  detection  method  one  can  focus  their  attention 
on  several  aspects  of  the  method.  First,  one  could  improve  upon  the  initial  detection 
algorithm  and  design  a  method  that  returns  fewer  false  candidates  than  RGI  filtering. 
One  could  also  extract  more  useful  features  from  candidate  lesion  sites  or  provide  extra 
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features  using  previous  mammograms  or  different  views  of  the  same  breast  to  provide 
“better”  information  to  the  classifier.  Finally,  one  can  use  larger  database  sizes  and  have 
the  Bayesian  ANN  approximate  the  ideal  observer  with  more  features  (t.e.,  information) 
used  as  inputs. 

VII.  Conclusions 

We  designed  pattern  classifiers  to  distinguish  between  malignant  mass  lesions  and  false 
candidates  in  the  computerized  detection  of  mass  lesions  in  mammography  using  both 
Bayesian  ANNs  and  the  multiobjective  training  approach.  Both  methods  were  found  to 
be  practically  useful  in  distinguishing  between  malignant  mass  lesions  and  false  detections. 
However,  there  is  a  need  for  further  improvements  in  the  performance  of  the  overall  tech¬ 
nique  if  it  is  be  used  in  a  clinical  setting.  It  has  become  readily  apparent  that  one  must 
have  large  databases  to  properly  train  the  classifiers  using  as  much  useful  information 
as  possible.  With  more  useful  information  extracted  and  larger  training  databases,  the 
performance  of  the  overall  mass  detection  algorithm  is  expected  to  improve. 

Pattern  classifiers  are  widely  used  in  many  computerized  analysis  methods  for  a  wide 
variety  of  imaging  modalities.  Therefore,  the  methods  of  pattern  classification  presented 
here  can  be  applied  to  different  computerized  analysis  methods.  For  example,  the  multi¬ 
objective  approach  has  been  successfully  applied  to  the  computerized  detection  of  micro¬ 
calcifications  in  mammograms  by  Anastasio  et  al.  [57].  Bayesian  ANNs  have  recently  been 
applied  to  the  analysis  of  trabecular  bone  patterns  for  determining  the  risk  of  fracture 
using  texture  features  extracted  from  conventional  radiographs  [58],  and  to  the  characteri¬ 
zation  of  mass  lesions  as  benign  or  malignant  in  small-field  digital  mammography  systems 

[59]. 
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Computer-aided  diagnosis  has  the  potential  of  increasing  diagnostic  accuracy  by  providing  a  second 
reading  to  radiologists.  In  many  computerized  schemes,  numerous  features  can  be  extracted  to 
describe  suspect  image  regions.  A  subset  of  these  features  is  then  employed  in  a  data  classifier  to 
determine  whether  the  suspect  region  is  abnormal  or  normal.  Different  subsets  of  features  will,  in 
general,  result  in  different  classification  performances.  A  feature  selection  method  is  often  used  to 
determine  an  “optimal”  subset  of  features  to  use  with  a  particular  classifier.  A  classifier  perfor¬ 
mance  measure  (such  as  the  area  under  the  receiver  operating  characteristic  curve)  must  be  incor¬ 
porated  into  this  feature  selection  process.  With  limited  datasets,  however,  there  is  a  distribution  in 
the  classifier  performance  measure  for  a  given  classifier  and  subset  of  features.  In  this  paper,  we 
investigate  the  variation  in  the  selected  subset  of  “optimal”  features  as  compared  with  the  true 
optimal  subset  of  features  caused  by  this  distribution  of  classifier  performance.  We  consider  ex¬ 
amples  in  which  the  probability  that  the  optimal  subset  of  features  is  selected  can  be  analytically 
computed.  We  show  the  dependence  of  this  probability  on  the  dataset  sample  size,  the  total  number 
of  features  from  which  to  select,  the  number  of  features  selected,  and  the  performance  of  the  true 
optimal  subset.  Once  a  subset  of  features  has  been  selected,  the  parameters  of  the  data  classifier 
must  be  determined.  We  show  that,  with  limited  datasets  and/or  a  large  number  of  features  from 
which  to  choose,  bias  is  introduced  if  the  classifier  parameters  are  determined  using  the  same  data 
that  were  employed  to  select  the  “optimal”  subset  of  features.  ©  1999  American  Association  of 
Physicists  in  Medicine.  [S0094-2405(99)01010-X] 
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I.  INTRODUCTION 

Computerized  diagnosis  schemes  have  the  potential  of  in¬ 
creasing  diagnostic  accuracy  in  radiological  imaging.1"4  In 
computerized  schemes,  features  characterizing  suspicious 
image  regions  are  extracted  and  input  to  a  data  classifier  to 
predict  pathology.5'7  Different  combinations  of  features  will, 
in  general,  yield  different  classification  performances.  In  ad¬ 
dition,  relatively  few  features  should  be  used  in  order  for  the 
classifier  to  remain  robust.  Thus,  one  is  faced  with  the  task  of 
selecting  a  Useful  and  limited  subset  of  features  from  many 
available  features. 

In  automated  feature  selection,  a  computer  algorithm  de¬ 
termines  the  subset  of  features  that  will  result  in  the  best 
classification  performance.  Jain  et  al%  review  the  different 
types  of  feature  selection  methods,  which  they  broadly  cat¬ 
egorized  into  deterministic  methods  such  as  stepwise  feature 
selection,9  stochastic  methods  such  as  genetic  algorithm  fea¬ 
ture  selection,7,10,11  and  optimal  methods  that  are  usually 
prohibitively  time  consuming  such  as  an  exhaustive  search 
of  all  possible  feature  subset  combinations.  Feature  selection 
is  becoming  a  vital  step  in  many  computerized  schemes  due 
to  the  large  number  of  features  that  can  be  extracted  from 
suspicious  image  regions.7,10  Sample  sizes,  however,  are  of¬ 
ten  limited  for  these  studies  due  to  the  nature  of  the  prob¬ 
lems;  for  example,  abnormal  cases  are  difficult  to  obtain  in 
many  diagnostic  procedures. 

In  this  paper,  we  study  the  effect  of  limited  sample  sizes 


and  large  total  numbers  of  features  on  the  ability  of  a  feature 
selector  to  select  the  optimal  subset  of  features.  We  also 
study  the  bias  introduced  if  one  uses  the  same  data  in  select¬ 
ing  features  and  in  determining  the  parameters  of  the  classi¬ 
fier.  Section  II  contains  an  introduction  to  classifier  feature 
selection  and  introduces  the  performance  measures  em¬ 
ployed  in  this  work.  In  Sec.  Ill  the  difficulty  of  feature  se¬ 
lection  with  limited  datasets  is  demonstrated  using  analytical 
calculations  of  a  simple  feature  selection  problem.  Section 
IV  discusses  the  bias  introduced  when  the  same  dataset  is 
employed  to  determine  the  subset  of  features  and  the  param¬ 
eters  of  the  classifier.  Finally,  Sec.  V  includes  a  discussion 
of  the  implications  of  this  work  to  other  feature  selection 
methods. 

II.  BACKGROUND 

A.  Feature  selection  and  classifiers 

A  binary  classifier  takes  an  observation  of  feature  values 
and  determines  whether  the  observation  belongs  to  either  the 
abnormal  class  7ta  or  the  normal  class  Ttn.  In  order  to 
achieve  optimal  performance,  both  a  set  of  optimal  features 
and  the  optimal  parameters  (structure)  of  the  classifier 
need  to  be  determined.  The  N  features  values  corre¬ 
sponding  to  a  particular  observation  that  is  to  be  classified 
can  be  expressed  by  a  /V-dimensional  random  vector 
x=[x1,x2,...,xN]  where  bold  symbols  denote  random  vari¬ 
ables.  We  label  the  N  features  making  up  this  random  vector 
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Distribution  of  Abnormal  and  Normal  Distribution  of  Measured  Az  Values 

Feature  Values 


Fig  1  When  limited  data  are  present,  there  is  randomness  associated  with  the  measurement  of  the  performance  of  a  classifier  This  randomness  can  be 
modeled.  From  knowledge  of  the  underlying  density  functions  of  the  data  (left  graph)  and  the  number  of  samples  in  each  class,  we  can  estimate  die 
distribution  of  measured  Az  values  (right  graph).  The  density  functions  of  the  data  are  represented  by  fx(x\ rr„)  for  the  abnormal  class  and  fx(x\ rr„)  for  the 
normal  class. 


by  the  set  T.  A  sample  of  data  called  the  training  dataset, 
along  with  its  corresponding  truth  information  ( 7rfl  or  7 Tb),  is 
used  to  determine  the  parameters  of  the  classifier  or  to  select 
a  set  of  features.  Throughout  this  paper,  we  use  the  notation 
that  there  are  Sa  abnormal  cases  and  Sn  normal  cases  in  the 
training  dataset  where  Sa  +  Sn  =  S.  Often  an  independent 
dataset  with  a  total  of  S'  =  5^  +  5'  samples  is  employed  for 
testing. 

It  is  well  known  that  determining  the  parameters  of  a 
classifier  with  many  features  can  be  detrimental  due  to  the 
“curse  of  dimensionality/’  It  is  therefore  often  necessary  to 
select  a  subset  of  features  to  use  for  classification.  For  any 
given  subset  of  features  ycT,  we  can  define  a  classifier 
performance  measure  g(y)  which  measures  the  performance 
of  the  subset  of  features  y  in  the  task  of  distinguishing  be¬ 
tween  the  class  ira  and  the  class  7rn .  The  task  of  feature 
selection,  as  proposed  by  Jain  et  a/.,8  is  to  select  a  subset  of 
n  features  denoted  by  the  set  Z  from  the  entire  set  of  features 
T  such  that 

g(Z)  =  max  g(^)  (1) 

yc?\y\=n 

where  \y\  is  the  number  of  elements  in  the  set  y.  We  assume, 
without  loss  of  generality,  that  we  are  trying  to  maximize  the 
function  g(*)-  In  this  work  we  concern  ourselves  with  the 
task  of  selecting  the  optimal  subset  of  n  features  and  not  with 


the  task  of  selecting  both  the  optimal  subset  and  the  number 
of  features  within  that  subset. 

B.  Performance  measures 

Receiver  operating  characteristic  (ROC)  analysis12,13  is  an 
accepted  method  for  evaluating  the  performance  of  classifi¬ 
ers.  The  area  under  the  maximum  likelihood  estimate  of  the 
ROC  curve,  A , , 14  is  commonly  used  as  a  single  performance 
measure  to  characterize  the  performance  of  a  classifier  de¬ 
spite  the  fact  that  there  are  some  drawbacks  to  this  or  any 
scalar  classifier  performance  measure.13  In  this  work  we  in¬ 
vestigate  the  effect  of  the  distribution  of  Az  measurements 
due  to  finite  sample  size  on  the  ability  to  select  an  optimal 
subset  of  features.  In  theory,  however,  there  will  always  be  a 
distribution  associated  with  any  measure  that  employs  lim¬ 
ited  data.  The  results  presented  in  this  paper  can  thus  be 
generalized  to  different  performance  measures  such  as  the 
partial  area  index15  or  other  traditional  performance 
measures.16  Attention  is  focused  on  Az ,  however,  because  it 
is  commonly  used  within  the  diagnostic  community  to  select 
features  and  to  characterize  the  performance  of  classifiers. 

From  knowledge  of  the  theoretical  Az  and  the  numbers  of 
observations  in  each  class  (Sa  and  Srt),  we  can  estimate  the 
distribution  of  the  random  variable  Az  of  measured  Az  values 
using  a  Gaussian  centered  at  the  theoretical  Az  with  an  esti¬ 
mate  of  the  standard  deviation  given  as 


a= V5- 


A, 


2  At 


—A\ 


(2) 
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Fig.  2.  An  illustration  of  a  feature  selection  problem.  In  this  example,  one  is 
selecting  a  total  of  n  features  from  a  total  of  AT  =15  independent  features 
where  r  =  5  features  have  a  theoretical  A,  value  of  A*0  =  0.7  and  N-r 
features  have  a  theoretical  Az  value  of  Ai2)=:0.6.  We  will  consider  the 
situation  in  which  n^r.  The  distribution  of  Az  measurements  is  signified  by 
the  error  bars  on  each  feature’s  A,  value. 

Equation  (2)  was  derived  by  Hanley  et  al ,17  using  the  statis¬ 
tical  properties  of  the  Wilcoxon  statistic  to  predict  the  statis¬ 
tical  properties  of  the  area  under  an  ROC  curve.  Figure  1 
illustrates  the  distribution  of  Az  measurements  for  a  single 
feature  denoted  by  the  random  variable  x.  From  knowledge 
of  the  underlying  distribution  of  the  data  [Fig.  1(a)],  we  can 
estimate  the  distribution  of  the  measured  Az  values  [Fig. 
1(b)].  Both  the  theoretical  true  A,  and  the  sample  sizes  (Sa 
and  Sn)  determine  the  distribution  of  measured  Az  values. 
The  Gaussian  model  for  the  density  of  the  measured  Az  fails 
when  the  theoretical  Az  value  is  greater  than  approximately 
0.95.  The  theoretical  Az  values  for  all  the  simulation  studies 
performed  in  this  paper  are  below  0.95.  We  will  throughout 
this  paper  use  the  notation  that  an  estimate  of  the  area  under 
the  ROC  curve  is  given  by  an  Az  and  the  theoretical  true  area 
is  given  by  A, . 

III.  SELECTING  A  SUBSET  OF  FEATURES 

Let  us  consider  the  following  ideal  situation:  We  have  a 
total  of  N  independent  features  with  the  first  r  features  hav¬ 
ing  a  theoretical  individual  Az  value  of  A^  and  the  remain¬ 
ing  N-r  features  having  a  theoretical  individual  A,  of  A*2) 
where  Ai1)>A(,2) .  Because  the  features  in  this  situation  are 
independent,  we  conclude  that  the  random  variables  denoting 
the  N  measured  Az  values  are  also  independent  with  density 
functions  given  by  /,  (Az)  for  the  r  features  with  a  theoretical 
A,  values  of  A(zl)  and  }2{AZ)  for  the  N-r  features  with  a 
theoretical  Az  value  of  A<2) .  Similarly,  the  distribution  func¬ 
tions  are  given  by  Fj(Az)  for  the  r  features  with  a  theoretical 
A.  values  of  A[l)  and  F2(AZ)  for  the  N-r  features  with  a 
theoretical  Az  value  of  A<2).  Figure  2  illustrates  such  an 
example  where  the  error  bars  represent  the  standard  devia¬ 
tions  of  the  measured  Az  values  of  each  feature  [Eq.  (2)]. 

The  task  in  this  situation  is  to  select  the  n  features  (where 
n^r)  that  have  the  largest  measured  Az  values.  Because, 
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however,  the  measured  Az  values  have  a  distribution  associ¬ 
ated  with  them,  there  is  a  measurable  probability  that  one  or 
more  of  the  “worse”  features  (those  features  with  a  theoret¬ 
ical  Az  value  of  A<2)<A<°)  will  be  selected.  Using  order 
statistics,18’19  we  can  compute  the  probability  that  an  optimal 
subset  of  features  will  be  selected: 


r !  P 

p=7 - 7777 - 77  dAzfi(Az) 

(n-l)\(r-n)\  Jo 

X(l-F1(Az))n-1F1(AJ)r-',F2(Az)w-r,  (3) 


where  the  integration  is  from  0  to  1  because  Az  values  are 
bound  between  0  and  1.  A  derivation  of  Eq.  (3)  is  given  in 
the  Appendix.  In  theory,  the  probability  in  the  situation 
where  each  independent  feature  has  a  different  theoretical  A, 
value  could  be  computed,  but  it  is  computationally  impracti- 
cal. 

Figure  3(a)  plots  the  probability  of  an  optimal  subset  of  4 
features  being  selected  as  a  function  of  the  total  number  of 
features  to  select  from  N  [see  Eqn.  (3)].  In  this  plot  the  total 
number  of  features  with  theoretical  AZ=AZ1}  was  4,  A^  was 
set  at  0.70,  and  A<2)  was  fixed  at  0.60.  The  sample  size  5  was 
also  varied  from  100  to  1000  where  there  were  equal  num¬ 
bers  of  abnormal  and  normal  observations,  i.e.,  Sa=Sn 
=  5/2.  As  Fig.  3(a)  shows,  with  small  sample  sizes  the  prob¬ 
ability  of  selecting  an  optimal  subset  of  features  drops 
quickly  as  the  total  number  of  features  JV  increases.  Figure 
3(b)  shows  similar  plots  (n  —  4,  r=4)  but  with  higher  theo¬ 
retical  Az  values,  i.e.,  Az1)  =  0.8  and  A*2)  =  0.7.  The  differ¬ 
ence  in  theoretical  Az  values  (A<1}-A<2))  is  the  same  in 
Figs.  3(a)  and  3(b),  but  as  these  two  figures  show,  the  prob¬ 
ability  of  selecting  an  optimal  subset  of  features  depends  on 
the  theoretical  Az  values  and  not  just  the  difference  between 
the  theoretical  Az’s  of  the  “good”  and  “bad”  features.  In 
this  case,  the  probability  of  selecting  an  optimal  subset  of 
features  decreases  as  the  theoretical  Azs  decrease. 

Figure  4  shows  a  plot  analogous  to  Fig.  3(a),  except  that 
there  are  now  a  total  of  10  features  with  a  theoretical  Az 
value  of  A<1} ,  i.e.,  r=  10  instead  of  4.  The  number  of  fea¬ 
tures  to  select,  however,  is  still  4  so  we  are  analyzing  the 
probability  that  the  4  selected  features  will  be  a  subset  of  the 
10  actually  better  features.  As  Figs.  3  and  4  illustrate,  when 
there  are  more  “good”  features  in  the  population  of  features 
(r=  10  instead  of  r= 4),  the  probability  of  selecting  an  op¬ 
timal  subset  tends  to  increase. 

Figures  3  and  4  show  the  effect  of  different  theoretical  Az 
values  on  the  probability  of  selecting  an  optimal  subset  of 
features.  Figure  5  demonstrates  this  effect  in  more  detail  by 
fixing  the  sample  size  at  5=  100  and  plotting  the  probability 
of  selecting  an  optimal  subset  for  various  combinations  of 
A(zl)  and  Az2).  As  one  would  expect,  when  the  difference 
between  the"  “good”  and  “bad”  features  is  large,  it  is  easier 
to  select  an  optimal  subset  than  when  the  difference  between 
the  two  classes  of  features  is  small. 
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Fig.  3.  A  plot  of  the  probability  of  selecting  ah  optimal  subset  of  n  =4  features  from  a  total  of  N  features.  For  (a),  there  are  a  total  of  r- 4  features  with  a 
theoretical  A.  value  of  A(.n  =  0.7  and  N-r  features  with  a  theoretical  A,  value  of  Aj2)  =  0.6.  For  (b),  there  are  a  total  of  r  =  4  features  with  a  theoretical  A: 
value  of  A(l,  =  08  and  N-r  features  with  a  theoretical  A,  value  of  A<2)  =  0.7.  The  probability  of  selecting  an  optimal  subset  of  features  is  also  plotted  for 
various  sample  sizes  5.  The  difference  in  At  values  is  0.1  for  both  (a)  and  (b)  and  yet  the  probability  of  selection  the  optimal  subset  is  higher  for  (b)  than  it 
is  for  (a).  The  probability  of  selecting  the  optimal  subset  of  features  depends  of  the  theoretical  Az  values  and  not  just  the  difference  in  Az  value  between  the 
better  and  worse  features. 


IV.  CLASSIFICATION  WITH  SELECTED  FEATURES 

The  feature  selection  method  analyzed  in  the  preceding 
section  selects  the  n  features  with  the  largest  measured  Az 
values.  As  illustrated,  under  certain  realistic  conditions 
(small  sample  sizes  and  large  total  numbers  of  features),  it 
becomes  unlikely  that  this  method  will  select  an  optimal  sub¬ 
set  of  features.  The  problems  under  these  conditions,  how¬ 
ever,  are  two-fold  because  it  is  not  only  difficult  to  select  an 
optimal  subset  (as  shown  earlier),  but  data  used  to  classify 
the  features  once  a  subset  has  been  selected  will  often  not  be 
representative  of  the  underlying  density  functions  because 
feature  selection  methods,  in  general,  aim  to  select  features 
with  high  measured  Az  values.  In  essence  if  a  poor  feature 


Fig.  4.  A  plot  of  the  probability  of  selecting  an  optimal  subset  of  n  =  4 
features  from  a  total  of  N  features.  There  are  a  total  of  r=  10  features  with 
a  theoretical  Az  value  of  A^)  =  0.7  and  N-r  features  with  a  theoretical  A, 
value  of  A^2,  =  0.6.  The  probability  of  selecting  an  optimal  subset  of  features 
is  also  plotted  for  various  sample  sizes  5. 


(theoretically  poor  Az)  randomly  results  in  a  high  Az  value, 
then  that  feature  is  selected  for  use  in  the  classifier  despite 
the  fact  that  the  sample  of  data  for  this  feature  poorly  repre¬ 
sents  its  underlying  distribution.  Thus  it  is  expected  that  if 
the  same  data  employed  to  select  the  features  in  this  manner 
is  also  employed  to  determine  the  parameters  of  the  classi¬ 
fier,  then  bias  will  be  introduced  into  the  classification  pro¬ 
cess  because  the  training  data  poorly  represents  the  true  den¬ 
sity  of  the  data. 

In  order  to  analyze  this  bias,  we  simulated  N  features 
using  Gaussian  distributions,  n  of  which  had  theoretical  A, 
values  of  Al1)  =  0.68,  and  N-n  of  which  had  a  theoretical  Az 


Fig.  5.  A  plot  of  the  probability  of  selecting  an  optimal  subset  of  n=4 
features  from  a  total  of  N  features.  There  are  a  total  of  r=4  features  with  a 
theoretical  A.  value  of  A(zl)  ,N-r  features  with  a  theoretical  A,  value  of 
A<2)  and  the  sample  size  is  fixed  at  5=  100.  The  probability  of  selecting  an 
optimal  subset  of  features  is  plotted  at  various  combinations  of  A[l)  and 
A<2). 
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Fig.  6.  A  total  of  r  =  4  Gaussian  features  were  simulated  with  a  theoretical 
A.  value  of  A [l }=  0.68  and  N-r  Gaussian  features  were  simulated  with  a 
theoretical  A,  value  of  A^2,  =  0.6.  Features  were  sampled  100  different  times 
and  the  top  n  =  4  features  were  selected  based  on  the  measured  Az  values  of 
the  individual  features.  The  selected  features  were  then  merged  using  linear 
discriminants  and  the  training  dataset  and  testing  dataset  Az  values  were 
computed.  The  thin  solid  line  at  an  A,  of  0.818  is  the  theoretical  true  A,  if 
the  4  independent  Gaussian  (A,  =  0.68)  features  are  merged  using  linear 
discriminants.  The  curves  above  this  theoretical  line  are  the  average  training 
dataset  Az  values  and  the  curves  below  the  theoretical  line  are  the  average 
testing  dataset  Az  values.  The  same  data  used  to  select  the  features  was 
employed  in  determining  the  parameters  of  the  linear  discriminants.  A  sub¬ 
stantial  amount  of  bias  is  introduced  is  one  has  small  sample  sizes  and  a 
large  total  number  of  features  from  which  to  select. 


value  of  A<2)  =  0.60.  The  n  features  with  the  highest  mea¬ 
sured  Az  values  were  then  combined  using  linear  discrimi¬ 
nant  analysis  to  merge  the  n-dimensional  features  to  a  scalar 
decision  variable  representing  the  distance  from  the  separa¬ 
tion  plane  to  the  point  in  question.  The  Az  value  of  the  clas¬ 
sifier  was  computed  using,  as  input  to  the  ROC  analysis,  the 
scalar  decision  variable  data  as  output  from  the  linear  dis¬ 
criminant  classifier.  The  same  data  employed  to  select  the  n 


features  is  used  to  determine  the  parameters  of  the  linear 
discriminant  that  merges  the  n  features.  We  also  tested  the 
classifier  on  an  independent  dataset  of  1000  samples.  This 
process  was  repeated  100  times  for  each  combination  of  pa¬ 
rameters  to  obtain  an  average  training  dataset  Az  and  testing 
dataset  Az  value  for  the  classifier.  Figure  6  shows  a  plot,  for 
various  total  numbers  of  features  N,  of  the  average  training 
and  testing  dataset  Az  values  as  a  function  of  the  sample  size 
S .  The  thin  solid  line  in  Fig.  6  is  the  theoretical  true  Az  value 
of  4  independent  Gaussian  features  with  equal  variances  and 
individual  Az  values  of  0.68,  merged  using  linear  discrimi¬ 
nants.  The  curves  above  the  theoretical  line  are  the  average 
training  dataset  Az  values,  and  the  curves  below  the  theoret¬ 
ical  line  are  the  average  testing  dataset  Az  values.  As  this 
figure  shows,  bias  is  introduced  when  the  same  data  are  used 
to  select  and  merge  features.  The  bias  is  enhanced  in  situa¬ 
tions  where  there  is  little  data  and  a  large  total  number  of 
features  N\  this  is  the  same  condition  in  which  it  is  the  most 
difficult  to  select  an  optimal  subset  of  features  (see  Figs. 
3-5).  Hence,  a  suboptimal  subset  of  features  is  most  likely 
selected  and  bias  is  introduced  because  we  are  employing  the 
same  dataset  to  select  features  and  determine  the  parameters 
of  the  classifier. 

To  further  demonstrate  this,  Fig.  7  plots  a  normalized  his¬ 
togram  of  the  features  that  were  actually  selected  over  many 
trials  when  there  were  a  total  of  30  features  (N=30)  from 
which  to  select  using  sample  sizes  of  80  [Fig.  7(a)]  and  200 
[Fig.  7(b)].  The  first  four  actually  best  features  are  more 
likely  to  be  selected  than  any  one  of  the  remaining  features, 
but  the  probability  of  selecting  all  four  best  features  is  nev¬ 
ertheless  small.  With  a  larger  sample  size  the  difference  in 
the  probability  of  selecting  one  of  the  best  features  versus 
one  of  the  worse  features  is  enhanced. 

V.  DISCUSSION 

In  this  paper  we  have  only  dealt  with  a  feature  selection 
methodology  that  employs  the  performances  of  the  indi- 


<a)  ‘  (b) 

FiG.  7.  Normalized  histograms  of  the  features  actually  selected  when  r  =  4,  n- 4,  A^l)  =  0.68,  A<2)=0.6,  N=  30  and  (a)  5  =  80  and  (b)  5  =  200.  The  shaded 
bars  are  the  n  =  4  actually  better  features  with  a  theoretical  Az  value  of  0.68.  The  better  features  are  more  likely  to  be  selected  than  any  one  worse  feature  but 
the  combined  probability  that  one  of  the  worse  features  will  be  selected  is  high.  The  chances  are  better  that  one  will  select  the  optimal  first  four  features  when 
the  sample  size  is  large  (b). 
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vidual  features  to  select  the  subset  to  be  merged  using  a 
classifier.  In  this  situation  there  are  a  total  of  N  measured  Az 
values  to  analyze.  Many  other  feature  selection  algorithms 
such  as  stepwise  feature  selection9  and  genetic  algorithm  fea¬ 
ture  selection7,10,11  measure  the  performance  of  combinations 
of  features  instead  of  individual  features.  In  these  types  of 
algorithms  instead  of  having  a  total  of  N  measurements, 
there  is  a  total  of  N\fnl(N-n)\  possible  measurements. 
This  situation  is  possibly  even  more  problematic  and  more 
sensitive  to  the  distribution  of  measured  Az  values  than  the 
individual  feature  selection  method  we  have  studied,  because 
the  probability  that  a  theoretically  poor  subset  of  features 
will  yield  a  high  performance  measure  is  increased. 

We  have  used  and  merged  only  independent  features. 
More  information  is  gained  when  two  independent  features 
are  merged  instead  of  two  correlated  features.  It  should  be 
noted  that  the  optimal  classifier  for  n  independent  Gaussian 
distributed  features  with  equal  variances  is  a  linear  discrimi¬ 
nant,  which  provides  justification  for  the  method  we  have 
chosen  to  merge  our  simulated  features.  In  general,  feature 
selection  methods  that  measure  the  performances  of  combi¬ 
nations  of  features  must  take  into  account  correlations 
among  the  features.  If  the  features  in  a  similar  study  were 
correlated,  then  a  feature  selection  methodology  which  em¬ 
ploys  the  performances  of  the  individual  features  would 
clearly  be  suboptimal.  We  have  restricted  our  attention  to  the 
simpler  case  of  independent  features  here  because  it  provides 
a  basis  for  analyzing  the  difficulty  in  selecting  features  and 
merging  the  selected  features. 

We  have  chosen  the  area  under  the  maximum  likelihood 
estimate  of  the  ROC  curve,  Az,14  as  our  performance  mea¬ 
sure  in  selecting  the  subset  of  features.  Because  we  have 
limited  data,  there  is  a  distribution  associated  with  the  mea¬ 
sured  Az  values,  and  we  have  studied  the  difficulty  in  select¬ 
ing  and  merging  features  using  Az  as  the  selection  criterion; 
in  practice,  there  are  always  problems  in  representing  the 
performance  of  classifiers  by  a  single  scalar  measure.  In 
these  simulation  studies  our  features  had  equal  variances. 
Under  this  condition  there  is  theoretically  no  ambiguity  in 
comparing  A,  values  only,  because  the  population  ROC 
curves  will  never  cross  if  they  have  unequal  A,  values. 

In  practice,  however,  one  could  run  into  difficulty  when 
using  A.  as  a  performance  measure  because  two  very  differ¬ 
ent  ROC  curves  could  have  equal  A,  values.  Any  scalar  per¬ 
formance  measure,  however,  is  going  to  have  a  distribution 
associated  with  it  due  to  limited  data  and  is  going  to  have 
difficulty  with  respect  to  the  ambiguity  associated  with  rep¬ 
resenting  the  performance  of  a  classifier  by  a  scalar  quantity 
and,  hence,  the  results  presented  in  this  paper  can  be  ex¬ 
tended,  in  general,  to  different  classifier  performance  mea¬ 
sures.  Equation  (3)  is  a  general  equation,  and  similar  equa¬ 
tions  could  be  derived  for  different  performance  measures 
but  with  different  limits  of  integration  and  density  functions. 
The  datasets  employed  in  this  study  had  equal  numbers  of 
abnormal  and  normal  cases,  i.e.,  Sa=Sn .  One  can  study  the 
effect  of  unequal  abnormal  and  normal  cases  by  noting  that 
only  the  variance  in  the  measured  Az  value  changes  [see  Eq. 
(2)]. 
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VI.  CONCLUSIONS 

We  have  shown  that  when  one  has  small  sample  sizes  and 
a  large  total  number  of  features  from  which  to  select,  the 
probability  of  selecting  an  optimal  or  even  near-optimal  sub¬ 
set  of  features  is  small.  As  the  sample  size  increases  and  the 
total  number  of  features  decreases,  the  probability  of  select¬ 
ing  an  optimal  subset  approaches  1.  The  difficulties  of  fea¬ 
ture  selection,  however,  are  twofold  because  bias  is  also  in¬ 
troduced  if  one  selects  features  and  determines  the 
parameters  of  a  classifier  using  the  same  dataset.  This  bias  is 
caused  by  the  fact  that  one  is  preferentially  selecting  features 
that  misrepresent  their  underlying  density  functions.  The  bias 
is  greater  when  one  has  small  sample  sizes  and  a  large  total 
number  of  features,  the  same  situation  in  which  it  is  unlikely 
that  one  will  select  an  optimal  subset  of  features. 

The  results  in  this  work  have  dealt  with  a  fairly  simple 
and  ideal  situation  of  independent  features  that,  in  the  limit 
as  the  number  of  samples  goes  to  infinity,  are  optimally 
merged  using  linear  discriminants.  In  practice,  the  difficulties 
of  features  selection  are,  in  fact,  more  troublesome  than  pre¬ 
sented  here.  The  results  presented  can  be  extended  to  perfor¬ 
mance  measures  other  than  Az  and  inferences  can  be  made 
about  other  feature  selection  methods  that  combine  features 
instead  of  measuring  their  individual  performances. 


APPENDIX 

Assume  N  independent  observations  x,  with  a  joint  den- 1 
sity  function  fi(xi)f2(x2)—fN(xN)  and  joint  distribution 
function  Fl(xl)F2(x2)’ "FN(xN).  Define  the  set  of  all  ob¬ 
servations  as  $—  { 1,2,. ..,Af}  and  let  S  be  partitioned  into  two 
subsets  B  and  W.  We  will  now  focus  attention  on  one  ele¬ 
ment  of  the  partition  B ,  which  will  be  labeled  x b .  The  prob¬ 
ability  (i)  that  xb  falls  between  x  and  x  +  dx  is  fb(x)dx ,  (ii) 
that  all  remaining  observation  in  B  are  larger  than  x  is 
n  B  .^(1  and  (iii)  that  all  observations  in  W  are 

less  than  x  is  n.eVVF;(x).  The  probability  that  (i),  (ii),  and 
(iii)  all  occur  simultaneously  is 


Mx)dx  n  fm  n  0 -fm).  (ad 

ieW  i  e 

Integrating  the  above  expression  over  all  x  arrives  at  the  total 
probability  that  the  observations  in  W  are  less  than  observa¬ 
tion  xb  which  is  less  than  all  other  observations  in  B.  By 
summing  over  all  of  the  elements  in  B,  one  arrives  at  the 
probability  that  all  of  the  observations  in  B  are  larger  than  all 
of  the  observations  in  W,  i.e., 

P(xtB)>X{W))=  2 


xn  Ffa)  n  (i-f.w) 

ie  W  ieB,i*j 


(A2) 


If  the  n  observations  of  B  are  sampled  from  the  same  density 
function  f\{x),  r-n  of  the  observations  in  W  are  also 
sampled  from  /,(x),  and  the  remaining  N-r  observations 
have  common  density  /2(x),  then  Eq.  (A2)  simplifies  to 
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PU{5}>X{w})  =  n  J  dxfi(x) 

XFl(xY-nF2{x)N~V -Fi(x))n~l 2 3.  (A3) 

The  above  expression  assumes  that  the  observations  in  B  are 
a  specific  subset  of  the  r  observations  with  density  fx{x). 
There  are,  however,  a  total  of  r\!n\(r-n)\  possible  subsets 
that  contain  n  observations  with  density  f\(x).  Multiplying 
Eq.  (A3)  by  this  fraction,  we  arrive  at  the  probability  that  the 
n  largest  observations  all  come  from  the  density  function 
/iW. 

7 - - 77  f  dxfMF^xy-" 

(n  —  1  )!(r— «)!  J 

XF2(x)N~r(l  —  Fx(x))n~1.  (A4) 

If  our  observations  x  are  measured  Az  values,  then  the  above 
expression  becomes  Eq.  (3). 
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